R/TfidfVectorizer.R
TfIdfVectorizer.RdCreates a tf-idf matrix
Given a list of text, it creates a sparse matrix consisting of tf-idf score for tokens from the text.
superml::CountVectorizer -> TfIdfVectorizer
sentencesa list containing sentences
max_dfWhen building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold, value lies between 0 and 1.
min_dfWhen building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold, value lies between 0 and 1.
max_featuresuse top features sorted by count to be used in bag of words matrix.
ngram_rangeThe lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of c(1, 1) means only unigrams, c(1, 2) means unigrams and bigrams, and c(2, 2) means only bigrams.
splitsplitting criteria for strings, default: " "
lowercaseconvert all characters to lowercase before tokenizing
regexregex expression to use for text cleaning.
remove_stopwordsa list of stopwords to use, by default it uses its inbuilt list of standard stopwords
smooth_idflogical, to prevent zero division, adds one to document frequencies, as if an extra document was seen containing every term in the collection exactly once
normlogical, if TRUE, each output row will have unit norm ‘l2’: Sum of squares of vector elements is 1. if FALSE returns non-normalized vectors, default: TRUE
new()TfIdfVectorizer$new(
min_df,
max_df,
max_features,
ngram_range,
regex,
remove_stopwords,
split,
lowercase,
smooth_idf,
norm
)min_dfnumeric, When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold, value lies between 0 and 1.
max_dfnumeric, When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold, value lies between 0 and 1.
max_featuresinteger, Build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
ngram_rangevector, The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of c(1, 1) means only unigrams, c(1, 2) means unigrams and bigrams, and c(2, 2) means only bigrams.
regexcharacter, regex expression to use for text cleaning.
remove_stopwordslist, a list of stopwords to use, by default it uses its inbuilt list of standard english stopwords
splitcharacter, splitting criteria for strings, default: " "
lowercaselogical, convert all characters to lowercase before tokenizing, default: TRUE
smooth_idflogical, to prevent zero division, adds one to document frequencies, as if an extra document was seen containing every term in the collection exactly once
normlogical, if TRUE, each output row will have unit norm ‘l2’: Sum of squares of vector elements is 1. if FALSE returns non-normalized vectors, default: TRUE
parallellogical, speeds up ngrams computation using n-1 cores, defaults: TRUE
TfIdfVectorizer$new()fit()sents = c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
tf = TfIdfVectorizer$new(smooth_idf = TRUE, min_df = 0.3)
tf$fit(sents)transform()
## ------------------------------------------------
## Method `TfIdfVectorizer$new`
## ------------------------------------------------
TfIdfVectorizer$new()
#> <TfIdfVectorizer>
#> Inherits from: <CountVectorizer>
#> Public:
#> clone: function (deep = FALSE)
#> fit: function (sentences)
#> fit_transform: function (sentences)
#> initialize: function (min_df, max_df, max_features, ngram_range, regex, remove_stopwords,
#> lowercase: TRUE
#> max_df: 1
#> max_features: NULL
#> min_df: 1
#> model: NULL
#> ngram_range: 1 1
#> norm: TRUE
#> regex: [^a-zA-Z0-9 ]
#> remove_stopwords: TRUE
#> sentences: NA
#> smooth_idf: TRUE
#> split:
#> transform: function (sentences)
#> Private:
#> check_args: function (x, max_value, what)
#> get_bow_df: function (sentences, use_tokens = NULL)
#> get_tokens: function (sentences, min_df = 1, max_df = 1, ngram_range = NULL,
#> gettfmatrix: function (countmatrix, sentences, norm, smooth_idf = TRUE)
#> preprocess: function (sentences, regex = "[^0-9a-zA-Z ]", lowercase, remove_stopwords)
## ------------------------------------------------
## Method `TfIdfVectorizer$fit`
## ------------------------------------------------
sents = c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
tf = TfIdfVectorizer$new(smooth_idf = TRUE, min_df = 0.3)
tf$fit(sents)
## ------------------------------------------------
## Method `TfIdfVectorizer$fit_transform`
## ------------------------------------------------
if (FALSE) {
sents <- c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
tf <- TfIdfVectorizer$new(smooth_idf = TRUE, min_df = 0.1)
tf_matrix <- tf$fit_transform(sents)
}
## ------------------------------------------------
## Method `TfIdfVectorizer$transform`
## ------------------------------------------------
if (FALSE) {
sents = c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
new_sents <- c("dark at night",'mothers day')
tf = TfIdfVectorizer$new(min_df=0.1)
tf$fit(sents)
tf_matrix <- tf$transform(new_sents)
}