Creates CountVectorizer Model.
Given a list of text, it generates a bag of words model and returns a sparse matrix consisting of token counts.
sentencesa list containing sentences
max_dfWhen building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold, value lies between 0 and 1.
min_dfWhen building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold, value lies between 0 and 1.
max_featuresBuild a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
ngram_rangeThe lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of c(1, 1) means only unigrams, c(1, 2) means unigrams and bigrams, and c(2, 2) means only bigrams.
splitsplitting criteria for strings, default: " "
lowercaseconvert all characters to lowercase before tokenizing
regexregex expression to use for text cleaning.
remove_stopwordsa list of stopwords to use, by default it uses its inbuilt list of standard stopwords
modelinternal attribute which stores the count model
new()CountVectorizer$new(
min_df,
max_df,
max_features,
ngram_range,
regex,
remove_stopwords,
split,
lowercase
)min_dfnumeric, When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold, value lies between 0 and 1.
max_dfnumeric, When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold, value lies between 0 and 1.
max_featuresinteger, Build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
ngram_rangevector, The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of c(1, 1) means only unigrams, c(1, 2) means unigrams and bigrams, and c(2, 2) means only bigrams.
regexcharacter, regex expression to use for text cleaning.
remove_stopwordslist, a list of stopwords to use, by default it uses its inbuilt list of standard english stopwords
splitcharacter, splitting criteria for strings, default: " "
lowercaselogical, convert all characters to lowercase before tokenizing, default: TRUE
cv = CountVectorizer$new(min_df=0.1)fit()sents = c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
cv = CountVectorizer$new(min_df=0.1)
cv$fit(sents)fit_transform()sents = c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
cv <- CountVectorizer$new(min_df=0.1)
cv_count_matrix <- cv$fit_transform(sents)transform()sents = c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
new_sents <- c("dark at night",'mothers day')
cv = CountVectorizer$new(min_df=0.1)
cv$fit(sents)
cv_count_matrix <- cv$transform(new_sents)
## ------------------------------------------------
## Method `CountVectorizer$new`
## ------------------------------------------------
cv = CountVectorizer$new(min_df=0.1)
## ------------------------------------------------
## Method `CountVectorizer$fit`
## ------------------------------------------------
sents = c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
cv = CountVectorizer$new(min_df=0.1)
cv$fit(sents)
## ------------------------------------------------
## Method `CountVectorizer$fit_transform`
## ------------------------------------------------
sents = c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
cv <- CountVectorizer$new(min_df=0.1)
cv_count_matrix <- cv$fit_transform(sents)
## ------------------------------------------------
## Method `CountVectorizer$transform`
## ------------------------------------------------
sents = c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
new_sents <- c("dark at night",'mothers day')
cv = CountVectorizer$new(min_df=0.1)
cv$fit(sents)
cv_count_matrix <- cv$transform(new_sents)