Data Pre-processing

Hui Lin

2019-02-12

Data can be dirty and usually is!

Data types

  1. Raw data
  2. Technically correct data
  3. Data that is proper for the model
  4. Summarized data
  5. Data with fixed format

Preprocessing Map

Data Cleaning

sim.dat <- read.csv("https://raw.githubusercontent.com/happyrabbit/DataScientistR/master/Data/SegData.csv ")
summary(sim.dat)

Data Cleaning

# set problematic values as missings
sim.dat$age[which(sim.dat$age>100)]<-NA
sim.dat$store_exp[which(sim.dat$store_exp<0)]<-NA
# see the results
summary(subset(sim.dat,select=c("age","income")))

Missing Values

Missing Values: median/mode

# save the result as another object
demo_imp<-impute(sim.dat,method="median/mode")
# check the first 5 columns, there is no missing values in other columns
summary(demo_imp[,1:5])
imp<-preProcess(sim.dat,method="medianImpute")
demo_imp2<-predict(imp,sim.dat)
summary(demo_imp2[,1:5])

Missing Values: K-nearest neighbors

imp<-preProcess(sim.dat,method="knnImpute",k=5)
# need to use predict() to get KNN result
demo_imp<-predict(imp,sim.dat)

Solve the problem

# find factor columns
imp<-preProcess(sim.dat,method="knnImpute",k=5)
idx<-which(lapply(sim.dat,class)=="factor")
demo_imp<-predict(imp,sim.dat[,-idx])
summary(demo_imp[,1:3])

Missing Values: Bagging Tree

imp<-preProcess(sim.dat,method="bagImpute")
demo_imp<-predict(imp,sim.dat)
summary(demo_imp[,1:5])

Centering and Scaling

income<-sim.dat$income
# calculate the mean of income
mux<-mean(income,na.rm=T)
# calculate the standard deviation of income
sdx<-sd(income,na.rm=T)
# centering
tr1<-income-mux
# scaling
tr2<-tr1/sdx
sdat<-subset(sim.dat,select=c("age","income"))
# set the "method" option
trans<-preProcess(sdat,method=c("center","scale"))
# use predict() function to get the final result
transformed<-predict(trans,sdat)

Resolve Skewness

# select the two columns and save them as dat_bc
dat_bc<-subset(sim.dat,select=c("store_trans","online_trans"))
(trans<-preProcess(dat_bc,method=c("BoxCox")))

Use predict() to get the transformed result:

transformed<-predict(trans,dat_bc)

Resolve Outliers

Z-score and modified Z-score

\[Z_{i}=\frac{Y_{i}-\bar{Y}}{s}\] where \(\bar{Y}\) and \(s\) are mean and standard deviation for \(Y\)

\[M_{i}=\frac{0.6745(Y_{i}-\bar{Y})}{MAD}\]

where MAD is the median of a series of \(|Y_{i} - \bar{Y}|\), called the median of the absolute dispersion

Collinearity

  1. Calculate the correlation matrix of the predictors.
  2. Determine the two predictors associated with the largest absolute pairwise correlation (call them predictors A and B).
  3. Determine the average correlation between A and the other variables. Do the same for predictor B.
  4. If A has a larger average correlation, remove it; otherwise, remove predictor B.
  5. Repeat Step 2-4 until no absolute correlations are above the threshold.

Sparse Variables

  1. The fraction of unique values over the sample size
  2. The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value.

Re-encode Dummy Variables

dumVar<-class.ind(sim.dat$gender)
head(dumVar)
dumMod<-dummyVars(~gender+house+income,
                  data=sim.dat,
                  # use "origional variable name + level" as new name
                  levelsOnly=F)
head(predict(dumMod,sim.dat))