Count Vectorizer

Creates CountVectorizer Model.

Details

Given a list of text, it generates a bag of words model and returns a sparse matrix consisting of token counts.

Public fields

sentences: a list containing sentences
max_df: When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold, value lies between 0 and 1.
min_df: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold, value lies between 0 and 1.
max_features: Build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
ngram_range: The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of c(1, 1) means only unigrams, c(1, 2) means unigrams and bigrams, and c(2, 2) means only bigrams.
split: splitting criteria for strings, default: " "
lowercase: convert all characters to lowercase before tokenizing
regex: regex expression to use for text cleaning.
remove_stopwords: a list of stopwords to use, by default it uses its inbuilt list of standard stopwords
model: internal attribute which stores the count model

Methods

Method `new()`

Usage

CountVectorizer$new(
  min_df,
  max_df,
  max_features,
  ngram_range,
  regex,
  remove_stopwords,
  split,
  lowercase
)

Arguments

min_df: numeric, When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold, value lies between 0 and 1.
max_df: numeric, When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold, value lies between 0 and 1.
max_features: integer, Build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
ngram_range: vector, The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of c(1, 1) means only unigrams, c(1, 2) means unigrams and bigrams, and c(2, 2) means only bigrams.
regex: character, regex expression to use for text cleaning.
remove_stopwords: list, a list of stopwords to use, by default it uses its inbuilt list of standard english stopwords
split: character, splitting criteria for strings, default: " "
lowercase: logical, convert all characters to lowercase before tokenizing, default: TRUE

Details

Create a new `CountVectorizer` object.

Returns

A `CountVectorizer` object.

Examples

cv = CountVectorizer$new(min_df=0.1)

Method `fit()`

Usage

CountVectorizer$fit(sentences)

Arguments

sentences: a list of text sentences

Details

Fits the countvectorizer model on sentences

Returns

NULL

Examples

sents = c('i am alone in dark.','mother_mary a lot',
          'alone in the dark?', 'many mothers in the lot....')
cv = CountVectorizer$new(min_df=0.1)
cv$fit(sents)

Method `fit_transform()`

Usage

CountVectorizer$fit_transform(sentences)

Arguments

sentences: a list of text sentences

Details

Fits the countvectorizer model and returns a sparse matrix of count of tokens

Returns

a sparse matrix containing count of tokens in each given sentence

Examples

sents = c('i am alone in dark.','mother_mary a lot',
         'alone in the dark?', 'many mothers in the lot....')
cv <- CountVectorizer$new(min_df=0.1)
cv_count_matrix <- cv$fit_transform(sents)

Method `transform()`

Usage

CountVectorizer$transform(sentences)

Arguments

sentences: a list of new text sentences

Details

Returns a matrix of count of tokens

Returns

a sparse matrix containing count of tokens in each given sentence

Examples

sents = c('i am alone in dark.','mother_mary a lot',
          'alone in the dark?', 'many mothers in the lot....')
new_sents <- c("dark at night",'mothers day')
cv = CountVectorizer$new(min_df=0.1)
cv$fit(sents)
cv_count_matrix <- cv$transform(new_sents)

Method `clone()`

The objects of this class are cloneable with this method.

Usage

CountVectorizer$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Examples


## ------------------------------------------------
## Method `CountVectorizer$new`
## ------------------------------------------------

cv = CountVectorizer$new(min_df=0.1)

## ------------------------------------------------
## Method `CountVectorizer$fit`
## ------------------------------------------------

sents = c('i am alone in dark.','mother_mary a lot',
          'alone in the dark?', 'many mothers in the lot....')
cv = CountVectorizer$new(min_df=0.1)
cv$fit(sents)

## ------------------------------------------------
## Method `CountVectorizer$fit_transform`
## ------------------------------------------------

sents = c('i am alone in dark.','mother_mary a lot',
         'alone in the dark?', 'many mothers in the lot....')
cv <- CountVectorizer$new(min_df=0.1)
cv_count_matrix <- cv$fit_transform(sents)

## ------------------------------------------------
## Method `CountVectorizer$transform`
## ------------------------------------------------

sents = c('i am alone in dark.','mother_mary a lot',
          'alone in the dark?', 'many mothers in the lot....')
new_sents <- c("dark at night",'mothers day')
cv = CountVectorizer$new(min_df=0.1)
cv$fit(sents)
cv_count_matrix <- cv$transform(new_sents)

Details

Public fields

Methods

Public methods

Method new()

Usage

Arguments

Details

Returns

Examples

Method fit()

Usage

Arguments

Details

Returns

Examples

Method fit_transform()

Usage

Arguments

Details

Returns

Examples

Method transform()

Usage

Arguments

Details

Returns

Examples

Method clone()

Usage

Arguments

Examples

Method `new()`

Method `fit()`

Method `fit_transform()`

Method `transform()`

Method `clone()`