TfIDF(Term Frequency Inverse Document Frequency) Vectorizer

Creates a tf-idf matrix

Details

Given a list of text, it creates a sparse matrix consisting of tf-idf score for tokens from the text.

Super class

superml::CountVectorizer -> TfIdfVectorizer

Public fields

sentences: a list containing sentences
max_df: When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold, value lies between 0 and 1.
min_df: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold, value lies between 0 and 1.
max_features: use top features sorted by count to be used in bag of words matrix.
ngram_range: The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of c(1, 1) means only unigrams, c(1, 2) means unigrams and bigrams, and c(2, 2) means only bigrams.
split: splitting criteria for strings, default: " "
lowercase: convert all characters to lowercase before tokenizing
regex: regex expression to use for text cleaning.
remove_stopwords: a list of stopwords to use, by default it uses its inbuilt list of standard stopwords
smooth_idf: logical, to prevent zero division, adds one to document frequencies, as if an extra document was seen containing every term in the collection exactly once
norm: logical, if TRUE, each output row will have unit norm ‘l2’: Sum of squares of vector elements is 1. if FALSE returns non-normalized vectors, default: TRUE

Methods

Method `new()`

Usage

TfIdfVectorizer$new(
  min_df,
  max_df,
  max_features,
  ngram_range,
  regex,
  remove_stopwords,
  split,
  lowercase,
  smooth_idf,
  norm
)

Arguments

min_df: numeric, When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold, value lies between 0 and 1.
max_df: numeric, When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold, value lies between 0 and 1.
max_features: integer, Build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
ngram_range: vector, The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of c(1, 1) means only unigrams, c(1, 2) means unigrams and bigrams, and c(2, 2) means only bigrams.
regex: character, regex expression to use for text cleaning.
remove_stopwords: list, a list of stopwords to use, by default it uses its inbuilt list of standard english stopwords
split: character, splitting criteria for strings, default: " "
lowercase: logical, convert all characters to lowercase before tokenizing, default: TRUE
smooth_idf: logical, to prevent zero division, adds one to document frequencies, as if an extra document was seen containing every term in the collection exactly once
norm: logical, if TRUE, each output row will have unit norm ‘l2’: Sum of squares of vector elements is 1. if FALSE returns non-normalized vectors, default: TRUE
parallel: logical, speeds up ngrams computation using n-1 cores, defaults: TRUE

Details

Create a new `TfIdfVectorizer` object.

Returns

A `TfIdfVectorizer` object.

Examples

TfIdfVectorizer$new()

Method `fit()`

Usage

TfIdfVectorizer$fit(sentences)

Arguments

sentences: a list of text sentences

Details

Fits the TfIdfVectorizer model on sentences

Returns

NULL

Examples

sents = c('i am alone in dark.','mother_mary a lot',
          'alone in the dark?', 'many mothers in the lot....')
tf = TfIdfVectorizer$new(smooth_idf = TRUE, min_df = 0.3)
tf$fit(sents)

Method `fit_transform()`

Usage

TfIdfVectorizer$fit_transform(sentences)

Arguments

sentences: a list of text sentences

Details

Fits the TfIdfVectorizer model and returns a sparse matrix of count of tokens

Returns

a sparse matrix containing tf-idf score for tokens in each given sentence

Examples

\dontrun{
sents <- c('i am alone in dark.','mother_mary a lot',
         'alone in the dark?', 'many mothers in the lot....')
tf <- TfIdfVectorizer$new(smooth_idf = TRUE, min_df = 0.1)
tf_matrix <- tf$fit_transform(sents)
}

Method `transform()`

Usage

TfIdfVectorizer$transform(sentences)

Arguments

sentences: a list of new text sentences

Details

Returns a matrix of tf-idf score of tokens

Returns

a sparse matrix containing tf-idf score for tokens in each given sentence

Examples

\dontrun{
sents = c('i am alone in dark.','mother_mary a lot',
          'alone in the dark?', 'many mothers in the lot....')
new_sents <- c("dark at night",'mothers day')
tf = TfIdfVectorizer$new(min_df=0.1)
tf$fit(sents)
tf_matrix <- tf$transform(new_sents)
}

Method `clone()`

The objects of this class are cloneable with this method.

Usage

TfIdfVectorizer$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Examples


## ------------------------------------------------
## Method `TfIdfVectorizer$new`
## ------------------------------------------------

TfIdfVectorizer$new()
#> <TfIdfVectorizer>
#>   Inherits from: <CountVectorizer>
#>   Public:
#>     clone: function (deep = FALSE) 
#>     fit: function (sentences) 
#>     fit_transform: function (sentences) 
#>     initialize: function (min_df, max_df, max_features, ngram_range, regex, remove_stopwords, 
#>     lowercase: TRUE
#>     max_df: 1
#>     max_features: NULL
#>     min_df: 1
#>     model: NULL
#>     ngram_range: 1 1
#>     norm: TRUE
#>     regex: [^a-zA-Z0-9 ]
#>     remove_stopwords: TRUE
#>     sentences: NA
#>     smooth_idf: TRUE
#>     split:  
#>     transform: function (sentences) 
#>   Private:
#>     check_args: function (x, max_value, what) 
#>     get_bow_df: function (sentences, use_tokens = NULL) 
#>     get_tokens: function (sentences, min_df = 1, max_df = 1, ngram_range = NULL, 
#>     gettfmatrix: function (countmatrix, sentences, norm, smooth_idf = TRUE) 
#>     preprocess: function (sentences, regex = "[^0-9a-zA-Z ]", lowercase, remove_stopwords) 

## ------------------------------------------------
## Method `TfIdfVectorizer$fit`
## ------------------------------------------------

sents = c('i am alone in dark.','mother_mary a lot',
          'alone in the dark?', 'many mothers in the lot....')
tf = TfIdfVectorizer$new(smooth_idf = TRUE, min_df = 0.3)
tf$fit(sents)

## ------------------------------------------------
## Method `TfIdfVectorizer$fit_transform`
## ------------------------------------------------

if (FALSE) {
sents <- c('i am alone in dark.','mother_mary a lot',
         'alone in the dark?', 'many mothers in the lot....')
tf <- TfIdfVectorizer$new(smooth_idf = TRUE, min_df = 0.1)
tf_matrix <- tf$fit_transform(sents)
}

## ------------------------------------------------
## Method `TfIdfVectorizer$transform`
## ------------------------------------------------

if (FALSE) {
sents = c('i am alone in dark.','mother_mary a lot',
          'alone in the dark?', 'many mothers in the lot....')
new_sents <- c("dark at night",'mothers day')
tf = TfIdfVectorizer$new(min_df=0.1)
tf$fit(sents)
tf_matrix <- tf$transform(new_sents)
}

TfIDF(Term Frequency Inverse Document Frequency) Vectorizer

Details

Super class

Public fields

Methods

Public methods

Method new()

Usage

Arguments

Details

Returns

Examples

Method fit()

Usage

Arguments

Details

Returns

Examples

Method fit_transform()

Usage

Arguments

Details

Returns

Examples

Method transform()

Usage

Arguments

Details

Returns

Examples

Method clone()

Usage

Arguments

Examples

Method `new()`

Method `fit()`

Method `fit_transform()`

Method `transform()`

Method `clone()`