vignettes/superml_tutorial.Rmd
superml_tutorial.Rmd
SuperML R package is designed to unify the model training process in R like Python. Generally, it’s seen that people spend lot of time in searching for packages, figuring out the syntax for training machine learning models in R. This behaviour is highly apparent in users who frequently switch between R and Python. This package provides a python´s scikit-learn interface (fit
, predict
) to train models faster.
In addition to building machine learning models, there are handy functionalities to do feature engineering
This ambitious package is my ongoing effort to help the r-community build ML models easily and faster in R.
This package uses existing r-packages to build machine learning model. In this tutorial, we’ll use data.table R package to do all tasks related to data manipulation.
We’ll quickly prepare the data set to be ready to served for model training.
load("../data/reg_train.rda")
library(data.table)
library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
#> Warning: package 'ggplot2' was built under R version 3.4.4
library(superml)
library(kableExtra)
#> Warning: package 'kableExtra' was built under R version 3.4.4
library(Metrics)
kable(head(reg_train, 10)) %>%
scroll_box(width = "100%", height = "300px")
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 60 | RL | 65 | 8450 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2003 | 2003 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 196 | Gd | TA | PConc | Gd | TA | No | GLQ | 706 | Unf | 0 | 150 | 856 | GasA | Ex | Y | SBrkr | 856 | 854 | 0 | 1710 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 8 | Typ | 0 | NA | Attchd | 2003 | RFn | 2 | 548 | TA | TA | Y | 0 | 61 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 2 | 2008 | WD | Normal | 208500 |
2 | 20 | RL | 80 | 9600 | Pave | NA | Reg | Lvl | AllPub | FR2 | Gtl | Veenker | Feedr | Norm | 1Fam | 1Story | 6 | 8 | 1976 | 1976 | Gable | CompShg | MetalSd | MetalSd | None | 0 | TA | TA | CBlock | Gd | TA | Gd | ALQ | 978 | Unf | 0 | 284 | 1262 | GasA | Ex | Y | SBrkr | 1262 | 0 | 0 | 1262 | 0 | 1 | 2 | 0 | 3 | 1 | TA | 6 | Typ | 1 | TA | Attchd | 1976 | RFn | 2 | 460 | TA | TA | Y | 298 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 5 | 2007 | WD | Normal | 181500 |
3 | 60 | RL | 68 | 11250 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2001 | 2002 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 162 | Gd | TA | PConc | Gd | TA | Mn | GLQ | 486 | Unf | 0 | 434 | 920 | GasA | Ex | Y | SBrkr | 920 | 866 | 0 | 1786 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 6 | Typ | 1 | TA | Attchd | 2001 | RFn | 2 | 608 | TA | TA | Y | 0 | 42 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 9 | 2008 | WD | Normal | 223500 |
4 | 70 | RL | 60 | 9550 | Pave | NA | IR1 | Lvl | AllPub | Corner | Gtl | Crawfor | Norm | Norm | 1Fam | 2Story | 7 | 5 | 1915 | 1970 | Gable | CompShg | Wd Sdng | Wd Shng | None | 0 | TA | TA | BrkTil | TA | Gd | No | ALQ | 216 | Unf | 0 | 540 | 756 | GasA | Gd | Y | SBrkr | 961 | 756 | 0 | 1717 | 1 | 0 | 1 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Detchd | 1998 | Unf | 3 | 642 | TA | TA | Y | 0 | 35 | 272 | 0 | 0 | 0 | NA | NA | NA | 0 | 2 | 2006 | WD | Abnorml | 140000 |
5 | 60 | RL | 84 | 14260 | Pave | NA | IR1 | Lvl | AllPub | FR2 | Gtl | NoRidge | Norm | Norm | 1Fam | 2Story | 8 | 5 | 2000 | 2000 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 350 | Gd | TA | PConc | Gd | TA | Av | GLQ | 655 | Unf | 0 | 490 | 1145 | GasA | Ex | Y | SBrkr | 1145 | 1053 | 0 | 2198 | 1 | 0 | 2 | 1 | 4 | 1 | Gd | 9 | Typ | 1 | TA | Attchd | 2000 | RFn | 3 | 836 | TA | TA | Y | 192 | 84 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 12 | 2008 | WD | Normal | 250000 |
6 | 50 | RL | 85 | 14115 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | Mitchel | Norm | Norm | 1Fam | 1.5Fin | 5 | 5 | 1993 | 1995 | Gable | CompShg | VinylSd | VinylSd | None | 0 | TA | TA | Wood | Gd | TA | No | GLQ | 732 | Unf | 0 | 64 | 796 | GasA | Ex | Y | SBrkr | 796 | 566 | 0 | 1362 | 1 | 0 | 1 | 1 | 1 | 1 | TA | 5 | Typ | 0 | NA | Attchd | 1993 | Unf | 2 | 480 | TA | TA | Y | 40 | 30 | 0 | 320 | 0 | 0 | NA | MnPrv | Shed | 700 | 10 | 2009 | WD | Normal | 143000 |
7 | 20 | RL | 75 | 10084 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | Somerst | Norm | Norm | 1Fam | 1Story | 8 | 5 | 2004 | 2005 | Gable | CompShg | VinylSd | VinylSd | Stone | 186 | Gd | TA | PConc | Ex | TA | Av | GLQ | 1369 | Unf | 0 | 317 | 1686 | GasA | Ex | Y | SBrkr | 1694 | 0 | 0 | 1694 | 1 | 0 | 2 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Attchd | 2004 | RFn | 2 | 636 | TA | TA | Y | 255 | 57 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 8 | 2007 | WD | Normal | 307000 |
8 | 60 | RL | NA | 10382 | Pave | NA | IR1 | Lvl | AllPub | Corner | Gtl | NWAmes | PosN | Norm | 1Fam | 2Story | 7 | 6 | 1973 | 1973 | Gable | CompShg | HdBoard | HdBoard | Stone | 240 | TA | TA | CBlock | Gd | TA | Mn | ALQ | 859 | BLQ | 32 | 216 | 1107 | GasA | Ex | Y | SBrkr | 1107 | 983 | 0 | 2090 | 1 | 0 | 2 | 1 | 3 | 1 | TA | 7 | Typ | 2 | TA | Attchd | 1973 | RFn | 2 | 484 | TA | TA | Y | 235 | 204 | 228 | 0 | 0 | 0 | NA | NA | Shed | 350 | 11 | 2009 | WD | Normal | 200000 |
9 | 50 | RM | 51 | 6120 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | OldTown | Artery | Norm | 1Fam | 1.5Fin | 7 | 5 | 1931 | 1950 | Gable | CompShg | BrkFace | Wd Shng | None | 0 | TA | TA | BrkTil | TA | TA | No | Unf | 0 | Unf | 0 | 952 | 952 | GasA | Gd | Y | FuseF | 1022 | 752 | 0 | 1774 | 0 | 0 | 2 | 0 | 2 | 2 | TA | 8 | Min1 | 2 | TA | Detchd | 1931 | Unf | 2 | 468 | Fa | TA | Y | 90 | 0 | 205 | 0 | 0 | 0 | NA | NA | NA | 0 | 4 | 2008 | WD | Abnorml | 129900 |
10 | 190 | RL | 50 | 7420 | Pave | NA | Reg | Lvl | AllPub | Corner | Gtl | BrkSide | Artery | Artery | 2fmCon | 1.5Unf | 5 | 6 | 1939 | 1950 | Gable | CompShg | MetalSd | MetalSd | None | 0 | TA | TA | BrkTil | TA | TA | No | GLQ | 851 | Unf | 0 | 140 | 991 | GasA | Ex | Y | SBrkr | 1077 | 0 | 0 | 1077 | 1 | 0 | 1 | 0 | 2 | 2 | TA | 5 | Typ | 2 | TA | Attchd | 1939 | RFn | 1 | 205 | Gd | TA | Y | 0 | 4 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 1 | 2008 | WD | Normal | 118000 |
split <- createDataPartition(y = reg_train$SalePrice, p = 0.7)
xtrain <- reg_train[split$Resample1]
xtest <- reg_train[!split$Resample1]
# remove features with 90% or more missing values
# we will also remove the Id column because it doesn't contain
# any useful information
na_cols <- colSums(is.na(xtrain)) / nrow(xtrain)
na_cols <- names(na_cols[which(na_cols > 0.9)])
xtrain[, c(na_cols, "Id") := NULL]
xtest[, c(na_cols, "Id") := NULL]
# encode categorical variables
cat_cols <- names(xtrain)[sapply(xtrain, is.character)]
for(c in cat_cols){
lbl <- LabelEncoder$new()
lbl$fit(c(xtrain[[c]], xtest[[c]]))
xtrain[[c]] <- lbl$transform(xtrain[[c]])
xtest[[c]] <- lbl$transform(xtest[[c]])
}
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
# removing noise column
noise <- c('GrLivArea','TotalBsmtSF')
xtrain[, c(noise) := NULL]
xtest[, c(noise) := NULL]
# fill missing value with -1
xtrain[is.na(xtrain)] <- -1
xtest[is.na(xtest)] <- -1
# Simple Regression
lf <- LMTrainer$new(family="gaussian")
lf$fit(X = xtrain, y = "SalePrice")
summary(lf$model)
#>
#> Call:
#> glm(formula = f, family = self$family, data = X, weights = self$weights)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -288636 -14944 -1300 13352 252611
#>
#> Coefficients: (1 not defined because of singularities)
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 8.305e+05 1.556e+06 0.534 0.593773
#> MSSubClass -7.685e+01 4.506e+01 -1.705 0.088463 .
#> MSZoning -1.965e+03 1.455e+03 -1.350 0.177316
#> LotFrontage -2.483e+01 3.215e+01 -0.772 0.440109
#> LotArea 5.571e-01 1.208e-01 4.613 4.52e-06 ***
#> Street -3.618e+04 1.737e+04 -2.083 0.037553 *
#> LotShape -1.312e+03 2.006e+03 -0.654 0.513282
#> LandContour 3.203e+03 1.813e+03 1.767 0.077616 .
#> Utilities NA NA NA NA
#> LotConfig 3.668e+03 1.627e+03 2.254 0.024404 *
#> LandSlope 1.874e+03 4.889e+03 0.383 0.701607
#> Neighborhood -5.940e+02 1.982e+02 -2.997 0.002795 **
#> Condition1 -6.606e+02 9.701e+02 -0.681 0.496094
#> Condition2 -1.665e+04 3.335e+03 -4.993 7.07e-07 ***
#> BldgType -2.243e+03 1.938e+03 -1.157 0.247388
#> HouseStyle 9.602e+01 8.915e+02 0.108 0.914252
#> OverallQual 1.441e+04 1.360e+03 10.590 < 2e-16 ***
#> OverallCond 5.392e+03 1.257e+03 4.290 1.97e-05 ***
#> YearBuilt 4.012e+02 8.088e+01 4.960 8.34e-07 ***
#> YearRemodAdd 4.747e+01 7.764e+01 0.611 0.541078
#> RoofStyle 4.066e+03 2.081e+03 1.953 0.051061 .
#> RoofMatl -1.342e+04 2.888e+03 -4.647 3.85e-06 ***
#> Exterior1st 1.374e+02 5.457e+02 0.252 0.801210
#> Exterior2nd 2.492e+02 5.700e+02 0.437 0.662112
#> MasVnrType -3.454e+03 1.761e+03 -1.961 0.050163 .
#> MasVnrArea 3.152e+01 7.053e+00 4.469 8.82e-06 ***
#> ExterQual 6.782e+03 2.537e+03 2.674 0.007633 **
#> ExterCond -4.776e+02 1.768e+03 -0.270 0.787088
#> Foundation 8.243e+01 1.588e+03 0.052 0.958610
#> BsmtQual 6.282e+03 1.438e+03 4.367 1.40e-05 ***
#> BsmtCond -1.182e+03 1.894e+03 -0.624 0.532944
#> BsmtExposure -2.816e+03 1.360e+03 -2.071 0.038617 *
#> BsmtFinType1 -6.169e+02 7.477e+02 -0.825 0.409586
#> BsmtFinSF1 9.267e+00 5.833e+00 1.589 0.112459
#> BsmtFinType2 2.044e+02 1.377e+03 0.148 0.882039
#> BsmtFinSF2 6.418e+00 1.022e+01 0.628 0.530206
#> BsmtUnfSF 7.551e+00 5.488e+00 1.376 0.169119
#> Heating -7.220e+02 3.273e+03 -0.221 0.825437
#> HeatingQC 1.909e+02 1.402e+03 0.136 0.891732
#> CentralAir -7.521e+02 5.456e+03 -0.138 0.890388
#> Electrical 2.218e+03 2.111e+03 1.050 0.293852
#> `1stFlrSF` 4.218e+01 6.934e+00 6.083 1.71e-09 ***
#> `2ndFlrSF` 3.693e+01 5.659e+00 6.526 1.10e-10 ***
#> LowQualFinSF 1.872e+01 2.529e+01 0.740 0.459429
#> BsmtFullBath 1.219e+04 2.837e+03 4.297 1.91e-05 ***
#> BsmtHalfBath 5.407e+03 4.549e+03 1.188 0.234946
#> FullBath 8.630e+03 3.120e+03 2.766 0.005788 **
#> HalfBath 1.100e+03 2.905e+03 0.378 0.705166
#> BedroomAbvGr -7.437e+03 1.964e+03 -3.786 0.000163 ***
#> KitchenAbvGr -2.149e+04 6.030e+03 -3.564 0.000384 ***
#> KitchenQual 4.969e+03 1.715e+03 2.898 0.003845 **
#> TotRmsAbvGrd 4.908e+03 1.402e+03 3.500 0.000487 ***
#> Functional -3.120e+03 1.514e+03 -2.061 0.039537 *
#> Fireplaces 6.630e+03 2.147e+03 3.087 0.002078 **
#> FireplaceQu 4.640e+03 1.220e+03 3.804 0.000152 ***
#> GarageType 7.575e+02 1.243e+03 0.609 0.542483
#> GarageYrBlt -5.159e+00 5.424e+00 -0.951 0.341796
#> GarageFinish 2.816e+03 1.432e+03 1.966 0.049555 *
#> GarageCars 1.347e+04 3.378e+03 3.987 7.22e-05 ***
#> GarageArea 5.489e-01 1.118e+01 0.049 0.960857
#> GarageQual 2.894e+03 3.713e+03 0.779 0.435902
#> GarageCond -4.153e+03 3.129e+03 -1.327 0.184792
#> PavedDrive -2.570e+03 2.997e+03 -0.858 0.391238
#> WoodDeckSF 3.099e+01 8.823e+00 3.513 0.000465 ***
#> OpenPorchSF 9.344e+00 1.633e+01 0.572 0.567233
#> EnclosedPorch 3.353e+01 1.914e+01 1.752 0.080074 .
#> `3SsnPorch` 1.930e+01 3.611e+01 0.534 0.593164
#> ScreenPorch 6.104e+01 1.886e+01 3.237 0.001250 **
#> PoolArea -2.239e+02 3.403e+01 -6.578 7.84e-11 ***
#> Fence -9.298e+02 1.067e+03 -0.871 0.383843
#> MiscVal 2.310e+00 1.815e+00 1.273 0.203280
#> MoSold 7.303e+01 3.731e+02 0.196 0.844860
#> YrSold -8.810e+02 7.729e+02 -1.140 0.254620
#> SaleType 2.076e+03 1.165e+03 1.783 0.074974 .
#> SaleCondition 8.395e+02 1.233e+03 0.681 0.496242
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 947672939)
#>
#> Null deviance: 6.0324e+12 on 1023 degrees of freedom
#> Residual deviance: 9.0029e+11 on 950 degrees of freedom
#> AIC: 24145
#>
#> Number of Fisher Scoring iterations: 2
predictions <- lf$predict(df = xtest)
#> Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
#> ifelse(type == : prediction from a rank-deficient fit may be misleading
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 39576.66
# Lasso Regression
lf <- LMTrainer$new(family = "gaussian", alpha=1, lambda = 1000)
lf$fit(X = xtrain, y = "SalePrice")
predictions <- lf$predict(df = xtest)
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 46457.92
# Ridge Regression
lf <- LMTrainer$new(family = "gaussian", alpha=0)
lf$fit(X = xtrain, y = "SalePrice")
predictions <- lf$predict(df = xtest)
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 45590.46
# cross validation logistic
lf <- LMTrainer$new(family = "gaussian")
lf$cv_model(X = xtrain, y = 'SalePrice', nfolds = 5, parallel = FALSE)
#> Computation done.
predictions <- lf$cv_predict(df = xtest)
coefs <- lf$get_importance()
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 48574.82
# Random Forest
rf <- RFTrainer$new(n_estimators = 500,classification = 0)
rf$fit(X = xtrain, y = "SalePrice")
pred <- rf$predict(df = xtest)
rf$get_importance()
#> tmp.order.tmp..decreasing...TRUE..
#> OverallQual 762388648492
#> GarageCars 443917936129
#> GarageArea 438415022787
#> ExterQual 392634180169
#> 1stFlrSF 389111394288
#> YearBuilt 348061020505
#> FullBath 244921917678
#> GarageYrBlt 224987891963
#> KitchenQual 202316372983
#> BsmtFinSF1 196310696874
#> 2ndFlrSF 191806007761
#> TotRmsAbvGrd 182146601008
#> LotArea 181829996190
#> MasVnrArea 160671536767
#> YearRemodAdd 148015528521
#> Fireplaces 146619837937
#> BsmtQual 100594920086
#> FireplaceQu 97770197692
#> OpenPorchSF 82466007572
#> WoodDeckSF 75577528149
#> LotFrontage 68552506794
#> BsmtUnfSF 64260018301
#> Neighborhood 57704642543
#> HeatingQC 50929867504
#> GarageType 48658524141
#> HalfBath 40521815292
#> MSSubClass 40332272055
#> BedroomAbvGr 38902491856
#> OverallCond 31673065422
#> MoSold 31464135886
#> MasVnrType 30820487021
#> BsmtExposure 29659601614
#> Exterior2nd 29539375055
#> GarageFinish 29222993884
#> RoofStyle 27980738038
#> Exterior1st 26278475900
#> BsmtFinType1 24123405529
#> BsmtFullBath 21543238948
#> SaleType 18423272848
#> Foundation 18151392427
#> LotShape 18060000288
#> HouseStyle 18033925056
#> YrSold 17912302468
#> SaleCondition 17908048161
#> MSZoning 17088495795
#> LandContour 16243240135
#> BsmtHalfBath 14395629154
#> GarageCond 13602826881
#> LotConfig 13594582138
#> CentralAir 13499582766
#> EnclosedPorch 13312459808
#> RoofMatl 12796297978
#> LandSlope 11726075334
#> BldgType 11600814598
#> Condition1 9180518813
#> GarageQual 8596843126
#> Functional 8551067393
#> ScreenPorch 7930191094
#> BsmtCond 6804432449
#> BsmtFinSF2 6309947600
#> KitchenAbvGr 6210375729
#> PavedDrive 6193565641
#> BsmtFinType2 5719448464
#> ExterCond 4510600338
#> Fence 4435303313
#> Electrical 3425000994
#> Condition2 3218685335
#> Heating 3022339248
#> MiscVal 1994122437
#> 3SsnPorch 1602733376
#> PoolArea 937944638
#> Street 622907299
#> LowQualFinSF 527030467
#> Utilities 0
rmse(actual = xtest$SalePrice, predicted = pred)
#> [1] 34945.84
# XGBoost
xgb <- XGBTrainer$new(objective = "reg:linear"
, n_estimators = 500
, eval_metric = "rmse"
, maximize = F
, learning_rate = 0.1
,max_depth = 6)
xgb$fit(X = xtrain, y = "SalePrice", valid = xtest)
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-rmse:177067.359375 val-rmse:183883.125000
#> [51] train-rmse:8336.882812 val-rmse:36358.332031
#> [101] train-rmse:5049.429688 val-rmse:35171.792969
#> [151] train-rmse:3243.850830 val-rmse:34962.609375
#> [201] train-rmse:2092.932129 val-rmse:34885.316406
#> [251] train-rmse:1302.330566 val-rmse:34868.789062
#> [301] train-rmse:899.406860 val-rmse:34878.847656
#> [351] train-rmse:633.702637 val-rmse:34872.972656
#> [401] train-rmse:442.377075 val-rmse:34873.195312
#> [451] train-rmse:308.838348 val-rmse:34868.191406
#> [500] train-rmse:216.254410 val-rmse:34868.695312
pred <- xgb$predict(xtest)
rmse(actual = xtest$SalePrice, predicted = pred)
#> [1] 34868.69
Here, we will solve a simple binary classification problem (predict people who survived on titanic ship). The idea here is to demonstrate how to use this package to solve classification problems.
Data Preparation
# load class
load('../data/cla_train.rda')
kable(head(cla_train, 10)) %>%
scroll_box(width = "100%", height = "300px")
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | S | |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | S | |
6 | 0 | 3 | Moran, Mr. James | male | NA | 0 | 0 | 330877 | 8.4583 | Q | |
7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54 | 0 | 0 | 17463 | 51.8625 | E46 | S |
8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2 | 3 | 1 | 349909 | 21.0750 | S | |
9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27 | 0 | 2 | 347742 | 11.1333 | S | |
10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14 | 1 | 0 | 237736 | 30.0708 | C |
# split the data
split <- createDataPartition(y = cla_train$Survived,p = 0.7)
xtrain <- cla_train[split$Resample1]
xtest <- cla_train[!split$Resample1]
# encode categorical variables - longer way
lbl <- LabelEncoder$new()
lbl$fit(c(xtrain$Embarked, xtest$Embarked))
#> The data contains blank values. Imputing them with 'NA'
xtrain[, Embarked := lbl$transform(Embarked)]
xtest[, Embarked := lbl$transform(Embarked)]
#> The data contains blank values. Imputing them with 'NA'
lbl1 <- LabelEncoder$new()
lbl1$fit(c(xtrain$Sex, xtest$Sex))
xtrain[, Sex := lbl1$transform(Sex)]
xtest[, Sex := lbl1$transform(Sex)]
lbl2 <- LabelEncoder$new()
lbl2$fit(c(xtrain$Cabin, xtest$Cabin))
#> The data contains blank values. Imputing them with 'NA'
xtrain[, Cabin := lbl2$transform(Cabin)]
#> The data contains blank values. Imputing them with 'NA'
xtest[, Cabin := lbl2$transform(Cabin)]
#> The data contains blank values. Imputing them with 'NA'
# encode categorical variables - shorter way
# for(c in c('Embarked','Sex','Cabin')){
# lbl <- LabelEncoder$new()
# lbl$fit(c(xtrain[[c]], xtest[[c]]))
# xtrain[[c]] <- lbl$transform(xtrain[[c]])
# xtest[[c]] <- lbl$transform(xtest[[c]])
# }
# impute missing values
xtrain[, Age := replace(Age, is.na(Age), median(Age, na.rm = T))]
xtest[, Age := replace(Age, is.na(Age), median(Age, na.rm = T))]
# drop these features
to_drop <- c('PassengerId','Ticket','Name')
xtrain <- xtrain[,-c(to_drop), with=F]
xtest <- xtest[,-c(to_drop), with=F]
Now, our data is ready to be served for model training. Let’s do it.
Logistic Regression
lf <- LMTrainer$new(family="binomial")
lf$fit(X = xtrain, y = "Survived")
summary(lf$model)
#>
#> Call:
#> glm(formula = f, family = self$family, data = X, weights = self$weights)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.0412 -0.5675 -0.4147 0.6405 2.4965
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 1.422624 0.642783 2.213 0.026882 *
#> Pclass -0.810483 0.189792 -4.270 1.95e-05 ***
#> Sex 2.699019 0.238631 11.310 < 2e-16 ***
#> Age -0.046053 0.009754 -4.721 2.34e-06 ***
#> SibSp -0.324124 0.131146 -2.471 0.013456 *
#> Parch -0.132431 0.145039 -0.913 0.361205
#> Fare 0.001276 0.003220 0.396 0.691900
#> Cabin 0.018142 0.005505 3.296 0.000982 ***
#> Embarked 0.243929 0.142191 1.716 0.086253 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 814.73 on 623 degrees of freedom
#> Residual deviance: 537.44 on 615 degrees of freedom
#> AIC: 555.44
#>
#> Number of Fisher Scoring iterations: 5
predictions <- lf$predict(df = xtest)
auc(actual = xtest$Survived, predicted = predictions)
#> [1] 0.8498749
Lasso Logistic Regression
lf <- LMTrainer$new(family="binomial", alpha=1)
lf$cv_model(X = xtrain, y = "Survived", nfolds = 5, parallel = FALSE)
#> Computation done.
pred <- lf$cv_predict(df = xtest)
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.8607383
Ridge Logistic Regression
lf <- LMTrainer$new(family="binomial", alpha=0)
lf$cv_model(X = xtrain, y = "Survived", nfolds = 5, parallel = FALSE)
#> Computation done.
pred <- lf$cv_predict(df = xtest)
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.8572972
Random Forest
rf <- RFTrainer$new(n_estimators = 500,classification = 1, max_features = 3)
rf$fit(X = xtrain, y = "Survived")
pred <- rf$predict(df = xtest)
rf$get_importance()
#> tmp.order.tmp..decreasing...TRUE..
#> Sex 69.497749
#> Fare 48.744329
#> Age 46.546454
#> Cabin 26.147793
#> Pclass 20.569323
#> SibSp 11.960032
#> Embarked 9.807184
#> Parch 8.916833
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.8177682
XGBOOST
xgb <- XGBTrainer$new(objective = "binary:logistic"
, n_estimators = 500
, eval_metric = "auc"
, maximize = T
, learning_rate = 0.1
,max_depth = 6)
xgb$fit(X = xtrain, y = "Survived", valid = xtest)
#> converting the data into xgboost format..
#> starting with training...
#> [1] train-auc:0.871518 val-auc:0.862416
#> [51] train-auc:0.969107 val-auc:0.882266
#> [101] train-auc:0.984336 val-auc:0.875811
#> [151] train-auc:0.990848 val-auc:0.872000
#> [201] train-auc:0.993482 val-auc:0.873820
#> [251] train-auc:0.995670 val-auc:0.873479
#> [301] train-auc:0.997154 val-auc:0.872455
#> [351] train-auc:0.997790 val-auc:0.871772
#> [401] train-auc:0.998214 val-auc:0.870208
#> [451] train-auc:0.998471 val-auc:0.869042
#> [500] train-auc:0.998605 val-auc:0.868644
pred <- xgb$predict(xtest)
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.8686441
Let’s create some new feature based on target variable using target encoding and test a model.
# add target encoding features
xtrain[, feat_01 := smoothMean(train_df = xtrain,
test_df = xtest,
colname = "Embarked",
target = "Survived")$train[[2]]]
xtest[, feat_01 := smoothMean(train_df = xtrain,
test_df = xtest,
colname = "Embarked",
target = "Survived")$test[[2]]]
# train a random forest
# Random Forest
rf <- RFTrainer$new(n_estimators = 500,classification = 1, max_features = 4)
rf$fit(X = xtrain, y = "Survived")
pred <- rf$predict(df = xtest)
rf$get_importance()
#> tmp.order.tmp..decreasing...TRUE..
#> Sex 71.025202
#> Fare 51.960577
#> Age 51.357592
#> Cabin 25.915233
#> Pclass 22.019871
#> SibSp 12.146112
#> Parch 8.873968
#> Embarked 6.101673
#> feat_01 5.737802
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.8202423