SuperML R package is designed to unify the model training process in R like Python. Generally, it’s seen that people spend lot of time in searching for packages, figuring out the syntax for training machine learning models in R. This behaviour is highly apparent in users who frequently switch between R and Python. This package provides a python´s scikit-learn interface (fit, predict) to train models faster.

In addition to building machine learning models, there are handy functionalities to do feature engineering

This ambitious package is my ongoing effort to help the r-community build ML models easily and faster in R.

Install

This package is still not on cran. Currently, you can install using:

devtools::install_github("saraswatmks/superml")

Examples - Machine Learning Models

This package uses existing r-packages to build machine learning model. In this tutorial, we’ll use data.table R package to do all tasks related to data manipulation.

Regression Data

We’ll quickly prepare the data set to be ready to served for model training.

load("../data/reg_train.rda")

library(data.table)
library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
#> Warning: package 'ggplot2' was built under R version 3.4.4
library(superml)
library(kableExtra)
#> Warning: package 'kableExtra' was built under R version 3.4.4
library(Metrics)

kable(head(reg_train, 10)) %>%
  scroll_box(width = "100%", height = "300px")
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
1 60 RL 65 8450 Pave NA Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2003 2003 Gable CompShg VinylSd VinylSd BrkFace 196 Gd TA PConc Gd TA No GLQ 706 Unf 0 150 856 GasA Ex Y SBrkr 856 854 0 1710 1 0 2 1 3 1 Gd 8 Typ 0 NA Attchd 2003 RFn 2 548 TA TA Y 0 61 0 0 0 0 NA NA NA 0 2 2008 WD Normal 208500
2 20 RL 80 9600 Pave NA Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm 1Fam 1Story 6 8 1976 1976 Gable CompShg MetalSd MetalSd None 0 TA TA CBlock Gd TA Gd ALQ 978 Unf 0 284 1262 GasA Ex Y SBrkr 1262 0 0 1262 0 1 2 0 3 1 TA 6 Typ 1 TA Attchd 1976 RFn 2 460 TA TA Y 298 0 0 0 0 0 NA NA NA 0 5 2007 WD Normal 181500
3 60 RL 68 11250 Pave NA IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2001 2002 Gable CompShg VinylSd VinylSd BrkFace 162 Gd TA PConc Gd TA Mn GLQ 486 Unf 0 434 920 GasA Ex Y SBrkr 920 866 0 1786 1 0 2 1 3 1 Gd 6 Typ 1 TA Attchd 2001 RFn 2 608 TA TA Y 0 42 0 0 0 0 NA NA NA 0 9 2008 WD Normal 223500
4 70 RL 60 9550 Pave NA IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm 1Fam 2Story 7 5 1915 1970 Gable CompShg Wd Sdng Wd Shng None 0 TA TA BrkTil TA Gd No ALQ 216 Unf 0 540 756 GasA Gd Y SBrkr 961 756 0 1717 1 0 1 0 3 1 Gd 7 Typ 1 Gd Detchd 1998 Unf 3 642 TA TA Y 0 35 272 0 0 0 NA NA NA 0 2 2006 WD Abnorml 140000
5 60 RL 84 14260 Pave NA IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm 1Fam 2Story 8 5 2000 2000 Gable CompShg VinylSd VinylSd BrkFace 350 Gd TA PConc Gd TA Av GLQ 655 Unf 0 490 1145 GasA Ex Y SBrkr 1145 1053 0 2198 1 0 2 1 4 1 Gd 9 Typ 1 TA Attchd 2000 RFn 3 836 TA TA Y 192 84 0 0 0 0 NA NA NA 0 12 2008 WD Normal 250000
6 50 RL 85 14115 Pave NA IR1 Lvl AllPub Inside Gtl Mitchel Norm Norm 1Fam 1.5Fin 5 5 1993 1995 Gable CompShg VinylSd VinylSd None 0 TA TA Wood Gd TA No GLQ 732 Unf 0 64 796 GasA Ex Y SBrkr 796 566 0 1362 1 0 1 1 1 1 TA 5 Typ 0 NA Attchd 1993 Unf 2 480 TA TA Y 40 30 0 320 0 0 NA MnPrv Shed 700 10 2009 WD Normal 143000
7 20 RL 75 10084 Pave NA Reg Lvl AllPub Inside Gtl Somerst Norm Norm 1Fam 1Story 8 5 2004 2005 Gable CompShg VinylSd VinylSd Stone 186 Gd TA PConc Ex TA Av GLQ 1369 Unf 0 317 1686 GasA Ex Y SBrkr 1694 0 0 1694 1 0 2 0 3 1 Gd 7 Typ 1 Gd Attchd 2004 RFn 2 636 TA TA Y 255 57 0 0 0 0 NA NA NA 0 8 2007 WD Normal 307000
8 60 RL NA 10382 Pave NA IR1 Lvl AllPub Corner Gtl NWAmes PosN Norm 1Fam 2Story 7 6 1973 1973 Gable CompShg HdBoard HdBoard Stone 240 TA TA CBlock Gd TA Mn ALQ 859 BLQ 32 216 1107 GasA Ex Y SBrkr 1107 983 0 2090 1 0 2 1 3 1 TA 7 Typ 2 TA Attchd 1973 RFn 2 484 TA TA Y 235 204 228 0 0 0 NA NA Shed 350 11 2009 WD Normal 200000
9 50 RM 51 6120 Pave NA Reg Lvl AllPub Inside Gtl OldTown Artery Norm 1Fam 1.5Fin 7 5 1931 1950 Gable CompShg BrkFace Wd Shng None 0 TA TA BrkTil TA TA No Unf 0 Unf 0 952 952 GasA Gd Y FuseF 1022 752 0 1774 0 0 2 0 2 2 TA 8 Min1 2 TA Detchd 1931 Unf 2 468 Fa TA Y 90 0 205 0 0 0 NA NA NA 0 4 2008 WD Abnorml 129900
10 190 RL 50 7420 Pave NA Reg Lvl AllPub Corner Gtl BrkSide Artery Artery 2fmCon 1.5Unf 5 6 1939 1950 Gable CompShg MetalSd MetalSd None 0 TA TA BrkTil TA TA No GLQ 851 Unf 0 140 991 GasA Ex Y SBrkr 1077 0 0 1077 1 0 1 0 2 2 TA 5 Typ 2 TA Attchd 1939 RFn 1 205 Gd TA Y 0 4 0 0 0 0 NA NA NA 0 1 2008 WD Normal 118000

split <- createDataPartition(y = reg_train$SalePrice, p = 0.7)
xtrain <- reg_train[split$Resample1]
xtest <- reg_train[!split$Resample1]
# remove features with 90% or more missing values
# we will also remove the Id column because it doesn't contain
# any useful information
na_cols <- colSums(is.na(xtrain)) / nrow(xtrain)
na_cols <- names(na_cols[which(na_cols > 0.9)])

xtrain[, c(na_cols, "Id") := NULL]
xtest[, c(na_cols, "Id") := NULL]

# encode categorical variables
cat_cols <- names(xtrain)[sapply(xtrain, is.character)]

for(c in cat_cols){
    lbl <- LabelEncoder$new()
    lbl$fit(c(xtrain[[c]], xtest[[c]]))
    xtrain[[c]] <- lbl$transform(xtrain[[c]])
    xtest[[c]] <- lbl$transform(xtest[[c]])
}
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA'

# removing noise column
noise <- c('GrLivArea','TotalBsmtSF')

xtrain[, c(noise) := NULL]
xtest[, c(noise) := NULL]

# fill missing value with  -1
xtrain[is.na(xtrain)] <- -1
xtest[is.na(xtest)] <- -1
# Simple Regression
lf <- LMTrainer$new(family="gaussian")
lf$fit(X = xtrain, y = "SalePrice")
summary(lf$model)
#> 
#> Call:
#> glm(formula = f, family = self$family, data = X, weights = self$weights)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -288636   -14944    -1300    13352   252611  
#> 
#> Coefficients: (1 not defined because of singularities)
#>                 Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)    8.305e+05  1.556e+06   0.534 0.593773    
#> MSSubClass    -7.685e+01  4.506e+01  -1.705 0.088463 .  
#> MSZoning      -1.965e+03  1.455e+03  -1.350 0.177316    
#> LotFrontage   -2.483e+01  3.215e+01  -0.772 0.440109    
#> LotArea        5.571e-01  1.208e-01   4.613 4.52e-06 ***
#> Street        -3.618e+04  1.737e+04  -2.083 0.037553 *  
#> LotShape      -1.312e+03  2.006e+03  -0.654 0.513282    
#> LandContour    3.203e+03  1.813e+03   1.767 0.077616 .  
#> Utilities             NA         NA      NA       NA    
#> LotConfig      3.668e+03  1.627e+03   2.254 0.024404 *  
#> LandSlope      1.874e+03  4.889e+03   0.383 0.701607    
#> Neighborhood  -5.940e+02  1.982e+02  -2.997 0.002795 ** 
#> Condition1    -6.606e+02  9.701e+02  -0.681 0.496094    
#> Condition2    -1.665e+04  3.335e+03  -4.993 7.07e-07 ***
#> BldgType      -2.243e+03  1.938e+03  -1.157 0.247388    
#> HouseStyle     9.602e+01  8.915e+02   0.108 0.914252    
#> OverallQual    1.441e+04  1.360e+03  10.590  < 2e-16 ***
#> OverallCond    5.392e+03  1.257e+03   4.290 1.97e-05 ***
#> YearBuilt      4.012e+02  8.088e+01   4.960 8.34e-07 ***
#> YearRemodAdd   4.747e+01  7.764e+01   0.611 0.541078    
#> RoofStyle      4.066e+03  2.081e+03   1.953 0.051061 .  
#> RoofMatl      -1.342e+04  2.888e+03  -4.647 3.85e-06 ***
#> Exterior1st    1.374e+02  5.457e+02   0.252 0.801210    
#> Exterior2nd    2.492e+02  5.700e+02   0.437 0.662112    
#> MasVnrType    -3.454e+03  1.761e+03  -1.961 0.050163 .  
#> MasVnrArea     3.152e+01  7.053e+00   4.469 8.82e-06 ***
#> ExterQual      6.782e+03  2.537e+03   2.674 0.007633 ** 
#> ExterCond     -4.776e+02  1.768e+03  -0.270 0.787088    
#> Foundation     8.243e+01  1.588e+03   0.052 0.958610    
#> BsmtQual       6.282e+03  1.438e+03   4.367 1.40e-05 ***
#> BsmtCond      -1.182e+03  1.894e+03  -0.624 0.532944    
#> BsmtExposure  -2.816e+03  1.360e+03  -2.071 0.038617 *  
#> BsmtFinType1  -6.169e+02  7.477e+02  -0.825 0.409586    
#> BsmtFinSF1     9.267e+00  5.833e+00   1.589 0.112459    
#> BsmtFinType2   2.044e+02  1.377e+03   0.148 0.882039    
#> BsmtFinSF2     6.418e+00  1.022e+01   0.628 0.530206    
#> BsmtUnfSF      7.551e+00  5.488e+00   1.376 0.169119    
#> Heating       -7.220e+02  3.273e+03  -0.221 0.825437    
#> HeatingQC      1.909e+02  1.402e+03   0.136 0.891732    
#> CentralAir    -7.521e+02  5.456e+03  -0.138 0.890388    
#> Electrical     2.218e+03  2.111e+03   1.050 0.293852    
#> `1stFlrSF`     4.218e+01  6.934e+00   6.083 1.71e-09 ***
#> `2ndFlrSF`     3.693e+01  5.659e+00   6.526 1.10e-10 ***
#> LowQualFinSF   1.872e+01  2.529e+01   0.740 0.459429    
#> BsmtFullBath   1.219e+04  2.837e+03   4.297 1.91e-05 ***
#> BsmtHalfBath   5.407e+03  4.549e+03   1.188 0.234946    
#> FullBath       8.630e+03  3.120e+03   2.766 0.005788 ** 
#> HalfBath       1.100e+03  2.905e+03   0.378 0.705166    
#> BedroomAbvGr  -7.437e+03  1.964e+03  -3.786 0.000163 ***
#> KitchenAbvGr  -2.149e+04  6.030e+03  -3.564 0.000384 ***
#> KitchenQual    4.969e+03  1.715e+03   2.898 0.003845 ** 
#> TotRmsAbvGrd   4.908e+03  1.402e+03   3.500 0.000487 ***
#> Functional    -3.120e+03  1.514e+03  -2.061 0.039537 *  
#> Fireplaces     6.630e+03  2.147e+03   3.087 0.002078 ** 
#> FireplaceQu    4.640e+03  1.220e+03   3.804 0.000152 ***
#> GarageType     7.575e+02  1.243e+03   0.609 0.542483    
#> GarageYrBlt   -5.159e+00  5.424e+00  -0.951 0.341796    
#> GarageFinish   2.816e+03  1.432e+03   1.966 0.049555 *  
#> GarageCars     1.347e+04  3.378e+03   3.987 7.22e-05 ***
#> GarageArea     5.489e-01  1.118e+01   0.049 0.960857    
#> GarageQual     2.894e+03  3.713e+03   0.779 0.435902    
#> GarageCond    -4.153e+03  3.129e+03  -1.327 0.184792    
#> PavedDrive    -2.570e+03  2.997e+03  -0.858 0.391238    
#> WoodDeckSF     3.099e+01  8.823e+00   3.513 0.000465 ***
#> OpenPorchSF    9.344e+00  1.633e+01   0.572 0.567233    
#> EnclosedPorch  3.353e+01  1.914e+01   1.752 0.080074 .  
#> `3SsnPorch`    1.930e+01  3.611e+01   0.534 0.593164    
#> ScreenPorch    6.104e+01  1.886e+01   3.237 0.001250 ** 
#> PoolArea      -2.239e+02  3.403e+01  -6.578 7.84e-11 ***
#> Fence         -9.298e+02  1.067e+03  -0.871 0.383843    
#> MiscVal        2.310e+00  1.815e+00   1.273 0.203280    
#> MoSold         7.303e+01  3.731e+02   0.196 0.844860    
#> YrSold        -8.810e+02  7.729e+02  -1.140 0.254620    
#> SaleType       2.076e+03  1.165e+03   1.783 0.074974 .  
#> SaleCondition  8.395e+02  1.233e+03   0.681 0.496242    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 947672939)
#> 
#>     Null deviance: 6.0324e+12  on 1023  degrees of freedom
#> Residual deviance: 9.0029e+11  on  950  degrees of freedom
#> AIC: 24145
#> 
#> Number of Fisher Scoring iterations: 2
predictions <- lf$predict(df = xtest)
#> Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
#> ifelse(type == : prediction from a rank-deficient fit may be misleading
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 39576.66
# Lasso Regression
lf <- LMTrainer$new(family = "gaussian", alpha=1, lambda = 1000)
lf$fit(X = xtrain, y = "SalePrice")
predictions <- lf$predict(df = xtest)
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 46457.92
# Ridge Regression
lf <- LMTrainer$new(family = "gaussian", alpha=0)
lf$fit(X = xtrain, y = "SalePrice")
predictions <- lf$predict(df = xtest)
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 45590.46
# cross validation logistic
lf <- LMTrainer$new(family = "gaussian")
lf$cv_model(X = xtrain, y = 'SalePrice', nfolds = 5, parallel = FALSE)
#> Computation done.
predictions <- lf$cv_predict(df = xtest)
coefs <- lf$get_importance()
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 48574.82
# Random Forest
rf <- RFTrainer$new(n_estimators = 500,classification = 0)
rf$fit(X = xtrain, y = "SalePrice")
pred <- rf$predict(df = xtest)
rf$get_importance()
#>               tmp.order.tmp..decreasing...TRUE..
#> OverallQual                         762388648492
#> GarageCars                          443917936129
#> GarageArea                          438415022787
#> ExterQual                           392634180169
#> 1stFlrSF                            389111394288
#> YearBuilt                           348061020505
#> FullBath                            244921917678
#> GarageYrBlt                         224987891963
#> KitchenQual                         202316372983
#> BsmtFinSF1                          196310696874
#> 2ndFlrSF                            191806007761
#> TotRmsAbvGrd                        182146601008
#> LotArea                             181829996190
#> MasVnrArea                          160671536767
#> YearRemodAdd                        148015528521
#> Fireplaces                          146619837937
#> BsmtQual                            100594920086
#> FireplaceQu                          97770197692
#> OpenPorchSF                          82466007572
#> WoodDeckSF                           75577528149
#> LotFrontage                          68552506794
#> BsmtUnfSF                            64260018301
#> Neighborhood                         57704642543
#> HeatingQC                            50929867504
#> GarageType                           48658524141
#> HalfBath                             40521815292
#> MSSubClass                           40332272055
#> BedroomAbvGr                         38902491856
#> OverallCond                          31673065422
#> MoSold                               31464135886
#> MasVnrType                           30820487021
#> BsmtExposure                         29659601614
#> Exterior2nd                          29539375055
#> GarageFinish                         29222993884
#> RoofStyle                            27980738038
#> Exterior1st                          26278475900
#> BsmtFinType1                         24123405529
#> BsmtFullBath                         21543238948
#> SaleType                             18423272848
#> Foundation                           18151392427
#> LotShape                             18060000288
#> HouseStyle                           18033925056
#> YrSold                               17912302468
#> SaleCondition                        17908048161
#> MSZoning                             17088495795
#> LandContour                          16243240135
#> BsmtHalfBath                         14395629154
#> GarageCond                           13602826881
#> LotConfig                            13594582138
#> CentralAir                           13499582766
#> EnclosedPorch                        13312459808
#> RoofMatl                             12796297978
#> LandSlope                            11726075334
#> BldgType                             11600814598
#> Condition1                            9180518813
#> GarageQual                            8596843126
#> Functional                            8551067393
#> ScreenPorch                           7930191094
#> BsmtCond                              6804432449
#> BsmtFinSF2                            6309947600
#> KitchenAbvGr                          6210375729
#> PavedDrive                            6193565641
#> BsmtFinType2                          5719448464
#> ExterCond                             4510600338
#> Fence                                 4435303313
#> Electrical                            3425000994
#> Condition2                            3218685335
#> Heating                               3022339248
#> MiscVal                               1994122437
#> 3SsnPorch                             1602733376
#> PoolArea                               937944638
#> Street                                 622907299
#> LowQualFinSF                           527030467
#> Utilities                                      0
rmse(actual = xtest$SalePrice, predicted = pred)
#> [1] 34945.84
# XGBoost
xgb <- XGBTrainer$new(objective = "reg:linear"
                      , n_estimators = 500
                      , eval_metric = "rmse"
                      , maximize = F
                      , learning_rate = 0.1
                      ,max_depth = 6)
xgb$fit(X = xtrain, y = "SalePrice", valid = xtest)
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-rmse:177067.359375    val-rmse:183883.125000 
#> [51] train-rmse:8336.882812  val-rmse:36358.332031 
#> [101]    train-rmse:5049.429688  val-rmse:35171.792969 
#> [151]    train-rmse:3243.850830  val-rmse:34962.609375 
#> [201]    train-rmse:2092.932129  val-rmse:34885.316406 
#> [251]    train-rmse:1302.330566  val-rmse:34868.789062 
#> [301]    train-rmse:899.406860   val-rmse:34878.847656 
#> [351]    train-rmse:633.702637   val-rmse:34872.972656 
#> [401]    train-rmse:442.377075   val-rmse:34873.195312 
#> [451]    train-rmse:308.838348   val-rmse:34868.191406 
#> [500]    train-rmse:216.254410   val-rmse:34868.695312
pred <- xgb$predict(xtest)
rmse(actual = xtest$SalePrice, predicted = pred)
#> [1] 34868.69

Binary Classification Data

Here, we will solve a simple binary classification problem (predict people who survived on titanic ship). The idea here is to demonstrate how to use this package to solve classification problems.

Data Preparation

# load class
load('../data/cla_train.rda')

kable(head(cla_train, 10)) %>%
  scroll_box(width = "100%", height = "300px")
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 S
6 0 3 Moran, Mr. James male NA 0 0 330877 8.4583 Q
7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.8625 E46 S
8 0 3 Palsson, Master. Gosta Leonard male 2 3 1 349909 21.0750 S
9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2 347742 11.1333 S
10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0 237736 30.0708 C

# split the data
split <- createDataPartition(y = cla_train$Survived,p = 0.7)
xtrain <- cla_train[split$Resample1]
xtest <- cla_train[!split$Resample1]

# encode categorical variables - longer way
lbl <- LabelEncoder$new()

lbl$fit(c(xtrain$Embarked, xtest$Embarked))
#> The data contains blank values. Imputing them with 'NA'
xtrain[, Embarked := lbl$transform(Embarked)]
xtest[, Embarked := lbl$transform(Embarked)]
#> The data contains blank values. Imputing them with 'NA'

lbl1 <- LabelEncoder$new()

lbl1$fit(c(xtrain$Sex, xtest$Sex))
xtrain[, Sex := lbl1$transform(Sex)]
xtest[, Sex := lbl1$transform(Sex)]

lbl2 <- LabelEncoder$new()

lbl2$fit(c(xtrain$Cabin, xtest$Cabin))
#> The data contains blank values. Imputing them with 'NA'
xtrain[, Cabin := lbl2$transform(Cabin)]
#> The data contains blank values. Imputing them with 'NA'
xtest[, Cabin := lbl2$transform(Cabin)]
#> The data contains blank values. Imputing them with 'NA'

# encode categorical variables - shorter way
# for(c in c('Embarked','Sex','Cabin')){
#     lbl <- LabelEncoder$new()
#     lbl$fit(c(xtrain[[c]], xtest[[c]]))
#     xtrain[[c]] <- lbl$transform(xtrain[[c]])
#     xtest[[c]] <- lbl$transform(xtest[[c]])
# }

# impute missing values
xtrain[, Age := replace(Age, is.na(Age), median(Age, na.rm = T))]
xtest[, Age := replace(Age, is.na(Age), median(Age, na.rm = T))]

# drop these features
to_drop <- c('PassengerId','Ticket','Name')

xtrain <- xtrain[,-c(to_drop), with=F]
xtest <- xtest[,-c(to_drop), with=F]

Now, our data is ready to be served for model training. Let’s do it.

Logistic Regression

lf <- LMTrainer$new(family="binomial")
lf$fit(X = xtrain, y = "Survived")
summary(lf$model)
#> 
#> Call:
#> glm(formula = f, family = self$family, data = X, weights = self$weights)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.0412  -0.5675  -0.4147   0.6405   2.4965  
#> 
#> Coefficients:
#>              Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  1.422624   0.642783   2.213 0.026882 *  
#> Pclass      -0.810483   0.189792  -4.270 1.95e-05 ***
#> Sex          2.699019   0.238631  11.310  < 2e-16 ***
#> Age         -0.046053   0.009754  -4.721 2.34e-06 ***
#> SibSp       -0.324124   0.131146  -2.471 0.013456 *  
#> Parch       -0.132431   0.145039  -0.913 0.361205    
#> Fare         0.001276   0.003220   0.396 0.691900    
#> Cabin        0.018142   0.005505   3.296 0.000982 ***
#> Embarked     0.243929   0.142191   1.716 0.086253 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 814.73  on 623  degrees of freedom
#> Residual deviance: 537.44  on 615  degrees of freedom
#> AIC: 555.44
#> 
#> Number of Fisher Scoring iterations: 5
predictions <- lf$predict(df = xtest)
auc(actual = xtest$Survived, predicted = predictions)
#> [1] 0.8498749

Lasso Logistic Regression

lf <- LMTrainer$new(family="binomial", alpha=1)
lf$cv_model(X = xtrain, y = "Survived", nfolds = 5, parallel = FALSE)
#> Computation done.
pred <- lf$cv_predict(df = xtest)
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.8607383

Ridge Logistic Regression

lf <- LMTrainer$new(family="binomial", alpha=0)
lf$cv_model(X = xtrain, y = "Survived", nfolds = 5, parallel = FALSE)
#> Computation done.
pred <- lf$cv_predict(df = xtest)
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.8572972

Random Forest

rf <- RFTrainer$new(n_estimators = 500,classification = 1, max_features = 3)
rf$fit(X = xtrain, y = "Survived")

pred <- rf$predict(df = xtest)
rf$get_importance()
#>          tmp.order.tmp..decreasing...TRUE..
#> Sex                               69.497749
#> Fare                              48.744329
#> Age                               46.546454
#> Cabin                             26.147793
#> Pclass                            20.569323
#> SibSp                             11.960032
#> Embarked                           9.807184
#> Parch                              8.916833

auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.8177682

XGBOOST

xgb <- XGBTrainer$new(objective = "binary:logistic"
                      , n_estimators = 500
                      , eval_metric = "auc"
                      , maximize = T
                      , learning_rate = 0.1
                      ,max_depth = 6)
xgb$fit(X = xtrain, y = "Survived", valid = xtest)
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-auc:0.871518  val-auc:0.862416 
#> [51] train-auc:0.969107  val-auc:0.882266 
#> [101]    train-auc:0.984336  val-auc:0.875811 
#> [151]    train-auc:0.990848  val-auc:0.872000 
#> [201]    train-auc:0.993482  val-auc:0.873820 
#> [251]    train-auc:0.995670  val-auc:0.873479 
#> [301]    train-auc:0.997154  val-auc:0.872455 
#> [351]    train-auc:0.997790  val-auc:0.871772 
#> [401]    train-auc:0.998214  val-auc:0.870208 
#> [451]    train-auc:0.998471  val-auc:0.869042 
#> [500]    train-auc:0.998605  val-auc:0.868644

pred <- xgb$predict(xtest)
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.8686441

Let’s create some new feature based on target variable using target encoding and test a model.

# add target encoding features
xtrain[, feat_01 := smoothMean(train_df = xtrain,
                        test_df = xtest,
                        colname = "Embarked",
                        target = "Survived")$train[[2]]]
xtest[, feat_01 := smoothMean(train_df = xtrain,
                               test_df = xtest,
                               colname = "Embarked",
                               target = "Survived")$test[[2]]]

# train a random forest
# Random Forest
rf <- RFTrainer$new(n_estimators = 500,classification = 1, max_features = 4)
rf$fit(X = xtrain, y = "Survived")
pred <- rf$predict(df = xtest)
rf$get_importance()
#>          tmp.order.tmp..decreasing...TRUE..
#> Sex                               71.025202
#> Fare                              51.960577
#> Age                               51.357592
#> Cabin                             25.915233
#> Pclass                            22.019871
#> SibSp                             12.146112
#> Parch                              8.873968
#> Embarked                           6.101673
#> feat_01                            5.737802

auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.8202423