Calculates target encodings using a smoothing parameter and count of categorical variables. This approach is more robust to possibility of leakage and avoid overfitting.
smoothMean(
train_df,
test_df,
colname,
target,
min_samples_leaf = 1,
smoothing = 1,
noise_level = 0
)
train dataset
test dataset
name of categorical column
name of target column
minimum samples to take category average into account
smoothing effect to balance categorical average vs prior
random noise to add, optional
a train and test data table with mean encodings of the target for the given categorical variable
train <- data.frame(region=c('del','csk','rcb','del','csk','pune','guj','del'),
win = c(0,1,1,0,0,1,0,1))
test <- data.frame(region=c('rcb','csk','rcb','del','guj','pune','csk','kol'))
# calculate encodings
all_means <- smoothMean(train_df = train,
test_df = test,
colname = 'region',
target = 'win')
train_mean <- all_means$train
test_mean <- all_means$test