Calculates target encodings using a smoothing parameter and count of categorical variables. This approach is more robust to possibility of leakage and avoid overfitting.

smoothMean(
  train_df,
  test_df,
  colname,
  target,
  min_samples_leaf = 1,
  smoothing = 1,
  noise_level = 0
)

Arguments

train_df

train dataset

test_df

test dataset

colname

name of categorical column

target

name of target column

min_samples_leaf

minimum samples to take category average into account

smoothing

smoothing effect to balance categorical average vs prior

noise_level

random noise to add, optional

Value

a train and test data table with mean encodings of the target for the given categorical variable

Examples

train <- data.frame(region=c('del','csk','rcb','del','csk','pune','guj','del'),
                    win = c(0,1,1,0,0,1,0,1))
test <- data.frame(region=c('rcb','csk','rcb','del','guj','pune','csk','kol'))

# calculate encodings
all_means <- smoothMean(train_df = train,
                         test_df = test,
                         colname = 'region',
                         target = 'win')
train_mean <- all_means$train
test_mean <- all_means$test