Data Cleaning

Do you see any problems?

sim.dat <- read.csv("https://raw.githubusercontent.com/happyrabbit/DataScientistR/master/Data/SegData.csv ")
summary(sim.dat)

Data Cleaning

How to deal with that?

# set problematic values as missings
sim.dat$age[which(sim.dat$age>100)]<-NA
sim.dat$store_exp[which(sim.dat$store_exp<0)]<-NA
# see the results
summary(subset(sim.dat,select=c("age","income")))

Missing Values

Is there any auxiliary information?
Is missing a random occurrence?
What is the purpose of modeling?

Missing Values: median/mode

impute() function in imputeMissings package

# save the result as another object
demo_imp<-impute(sim.dat,method="median/mode")
# check the first 5 columns, there is no missing values in other columns
summary(demo_imp[,1:5])

preProcess() function in caret package

imp<-preProcess(sim.dat,method="medianImpute")
demo_imp2<-predict(imp,sim.dat)
summary(demo_imp2[,1:5])

Missing Values: K-nearest neighbors

preProcess() function in caret package

imp<-preProcess(sim.dat,method="knnImpute",k=5)
# need to use predict() to get KNN result
demo_imp<-predict(imp,sim.dat)

Solve the problem

Reason: sim.dat has non-numeric variables
Solution:

# find factor columns
imp<-preProcess(sim.dat,method="knnImpute",k=5)
idx<-which(lapply(sim.dat,class)=="factor")
demo_imp<-predict(imp,sim.dat[,-idx])
summary(demo_imp[,1:3])

Missing Values: Bagging Tree

Bagging (Bootstrap aggregating)
Powerful
Computation is much more intense than KNN

imp<-preProcess(sim.dat,method="bagImpute")
demo_imp<-predict(imp,sim.dat)
summary(demo_imp[,1:5])

Centering and Scaling

Easy to DIY

income<-sim.dat$income
# calculate the mean of income
mux<-mean(income,na.rm=T)
# calculate the standard deviation of income
sdx<-sd(income,na.rm=T)
# centering
tr1<-income-mux
# scaling
tr2<-tr1/sdx

Use preProcess()

sdat<-subset(sim.dat,select=c("age","income"))
# set the "method" option
trans<-preProcess(sdat,method=c("center","scale"))
# use predict() function to get the final result
transformed<-predict(trans,sdat)

Resolve Skewness

describe(sim.dat)
Box-Cox transformation

# select the two columns and save them as dat_bc
dat_bc<-subset(sim.dat,select=c("store_trans","online_trans"))
(trans<-preProcess(dat_bc,method=c("BoxCox")))

Use predict() to get the transformed result:

transformed<-predict(trans,dat_bc)

Resolve Outliers

Defining outliers is hard
Basic visualizations: box-plot, histogram and scatterplot
Statistical methods to define outliers

Z-score and modified Z-score

Z-score

\[Z_{i}=\frac{Y_{i}-\bar{Y}}{s}\] where \(\bar{Y}\) and \(s\) are mean and standard deviation for \(Y\)

Modified Z-score

\[M_{i}=\frac{0.6745(Y_{i}-\bar{Y})}{MAD}\]

where MAD is the median of a series of \(|Y_{i} - \bar{Y}|\), called the median of the absolute dispersion

Iglewicz and Hoaglin suggest that the points with the Z-score greater than 3.5 are possible outliers

Collinearity

Define: corrplot()
Algorithm to remove a minimum number of predictors to ensure all pairwise correlations are below a certain threshold

Calculate the correlation matrix of the predictors.
Determine the two predictors associated with the largest absolute pairwise correlation (call them predictors A and B).
Determine the average correlation between A and the other variables. Do the same for predictor B.
If A has a larger average correlation, remove it; otherwise, remove predictor B.
Repeat Step 2-4 until no absolute correlations are above the threshold.

How to choose the threshold?

Sparse Variables

Detecting rules:

The fraction of unique values over the sample size
The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value.

Re-encode Dummy Variables

class.ind() from nnet package

dumVar<-class.ind(sim.dat$gender)
head(dumVar)

dummyVars() from caret

dumMod<-dummyVars(~gender+house+income,
                  data=sim.dat,
                  # use "origional variable name + level" as new name
                  levelsOnly=F)
head(predict(dumMod,sim.dat))

Data Pre-processing

Data can be dirty and usually is!

Data types

Preprocessing Map

Data Cleaning

Data Cleaning

Missing Values

Missing Values: median/mode

Missing Values: K-nearest neighbors

Solve the problem

Missing Values: Bagging Tree

Centering and Scaling

Resolve Skewness

Resolve Outliers

Z-score and modified Z-score

Collinearity

Sparse Variables

Re-encode Dummy Variables