vignettes/Guide-to-TfidfVectorizer.Rmd
Guide-to-TfidfVectorizer.Rmd
In this tutorial, we’ll look at how to create tfidf feature matrix in R in two simple steps with superml. Superml borrows speed gains using parallel computation and optimised functions from data.table R package. Tfidf matrix can be used to as features for a machine learning model. Also, we can use tdidf features as an embedding to represent the given texts.
You can install latest cran version using (recommended):
install.packages("superml")
You can install the developmemt version directly from github using:
devtools::install_github("saraswatmks/superml")
For machine learning, superml is based on the existing R packages. Hence, while installing the package, we don’t install all the dependencies. However, while training any model, superml will automatically install the package if its not found. Still, if you want to install all dependencies at once, you can simply do:
install.packages("superml", dependencies=TRUE)
First, we’ll create a sample data. Feel free to run it alongside in your laptop and check the results.
library(superml)
#> Loading required package: R6
# should be a vector of texts
sents <- c('i am going home and home',
'where are you going.? //// ',
'how does it work',
'transform your work and go work again',
'home is where you go from to work')
# generate more sentences
n <- 10
sents <- rep(sents, n)
length(sents)
#> [1] 50
For sample, we’ve generated 50 documents. Let’s create the features now. For ease, superml uses the similar API layout as python scikit-learn.
# initialise the class
tfv <- TfIdfVectorizer$new(max_features = 10, remove_stopwords = FALSE)
# generate the matrix
tf_mat <- tfv$fit_transform(sents)
head(tf_mat, 3)
#> work home you where going go and your transform
#> [1,] 0 0.8164966 0.0000000 0.0000000 0.4082483 0 0.4082483 0 0
#> [2,] 0 0.0000000 0.5773503 0.5773503 0.5773503 0 0.0000000 0 0
#> [3,] 1 0.0000000 0.0000000 0.0000000 0.0000000 0 0.0000000 0 0
#> to
#> [1,] 0
#> [2,] 0
#> [3,] 0
Few observations:
remove_stopwords = FALSE
defaults to TRUE
.
We set it to FALSE
since most of the words in our dummy
sents
are stopwords.max_features = 10
select the top 10 features (tokens)
based on frequency.norm = TRUE
is set by default.Now, let’s generate the matrix using its ngram_range
features.
# initialise the class
tfv <- TfIdfVectorizer$new(min_df = 0.4, remove_stopwords = FALSE, ngram_range = c(1, 3))
# generate the matrix
tf_mat <- tfv$fit_transform(sents)
head(tf_mat, 3)
#> work home you where going go and
#> [1,] 0 0.8164966 0.0000000 0.0000000 0.4082483 0 0.4082483
#> [2,] 0 0.0000000 0.5773503 0.5773503 0.5773503 0 0.0000000
#> [3,] 1 0.0000000 0.0000000 0.0000000 0.0000000 0 0.0000000
Few observations:
ngram_range = c(1,3)
set the lower and higher range
respectively of the resulting ngram tokens.min_df = 0.4
says to keep the tokens which occurs in
atleast 40% & above of the documents.In order to use Tfidf Vectorizer for a machine learning model,
sometimes it gets confusing as to which method
fit_transform
, fit
, transform
should be used to generate tfidf features for the given data. Here’s a
way to do:
library(data.table)
library(superml)
# use sents from above
sents <- c('i am going home and home',
'where are you going.? //// ',
'how does it work',
'transform your work and go work again',
'home is where you go from to work',
'how does it work')
# create dummy data
train <- data.table(text = sents, target = rep(c(0,1), 3))
test <- data.table(text = sample(sents), target = rep(c(0,1), 3))
Let’s see how the data looks like:
head(train, 3)
#> text target
#> 1: i am going home and home 0
#> 2: where are you going.? //// 1
#> 3: how does it work 0
head(test, 3)
#> text target
#> 1: i am going home and home 0
#> 2: how does it work 1
#> 3: how does it work 0
Now, we generate features for train-test data:
# initialise the class
tfv <- TfIdfVectorizer$new(min_df = 0.3, remove_stopwords = FALSE, ngram_range = c(1,3))
# we fit on train data
tfv$fit(train$text)
train_tf_features <- tfv$transform(train$text)
test_tf_features <- tfv$transform(test$text)
dim(train_tf_features)
#> [1] 6 15
dim(test_tf_features)
#> [1] 6 15
We generate 15 features for each of the given data. Let’s see how they look:
head(train_tf_features, 3)
#> work home you where it work it how does it
#> [1,] 0.0000000 0.8164966 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [2,] 0.0000000 0.0000000 0.5773503 0.5773503 0.0000000 0.0000000 0.0000000
#> [3,] 0.2478085 0.0000000 0.0000000 0.0000000 0.3425257 0.3425257 0.3425257
#> how does how going go does it work does it does
#> [1,] 0.0000000 0.0000000 0.4082483 0 0.0000000 0.0000000 0.0000000
#> [2,] 0.0000000 0.0000000 0.5773503 0 0.0000000 0.0000000 0.0000000
#> [3,] 0.3425257 0.3425257 0.0000000 0 0.3425257 0.3425257 0.3425257
#> and
#> [1,] 0.4082483
#> [2,] 0.0000000
#> [3,] 0.0000000
head(test_tf_features, 3)
#> work home you where it work it how does it how does
#> [1,] 0.0000000 0.8164966 0 0 0.0000000 0.0000000 0.0000000 0.0000000
#> [2,] 0.2478085 0.0000000 0 0 0.3425257 0.3425257 0.3425257 0.3425257
#> [3,] 0.2478085 0.0000000 0 0 0.3425257 0.3425257 0.3425257 0.3425257
#> how going go does it work does it does and
#> [1,] 0.0000000 0.4082483 0 0.0000000 0.0000000 0.0000000 0.4082483
#> [2,] 0.3425257 0.0000000 0 0.3425257 0.3425257 0.3425257 0.0000000
#> [3,] 0.3425257 0.0000000 0 0.3425257 0.3425257 0.3425257 0.0000000
Finally, to train a machine learning model on this, you can simply do:
# ensure the input to classifier is a data.table or data.frame object
x_train <- data.table(cbind(train_tf_features, target = train$target))
x_test <- data.table(test_tf_features)
xgb <- XGBTrainer$new(n_estimators = 10, objective = "binary:logistic")
xgb$fit(x_train, "target")
#> converting the data into xgboost format..
#> starting with training...
#> [10:48:48] WARNING: amalgamation/../src/learner.cc:627:
#> Parameters: { "nrounds" } might not be used.
#>
#> This could be a false alarm, with some parameters getting used by language bindings but
#> then being mistakenly passed down to XGBoost core, or some parameter actually being used
#> but getting flagged wrongly here. Please open an issue if you find any such cases.
#>
#>
#> [1] train-logloss:0.693147
#> Will train until train_logloss hasn't improved in 50 rounds.
#>
#> [10] train-logloss:0.693147
predictions <- xgb$predict(x_test)
predictions
#> [1] 0.5 0.5 0.5 0.5 0.5 0.5