Creates a tf-idf matrix

Details

Given a list of text, it creates a sparse matrix consisting of tf-idf score for tokens from the text.

Super class

superml::CountVectorizer -> TfIdfVectorizer

Public fields

sentences

a list containing sentences

max_df

When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold, value lies between 0 and 1.

min_df

When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold, value lies between 0 and 1.

max_features

use top features sorted by count to be used in bag of words matrix.

ngram_range

The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of c(1, 1) means only unigrams, c(1, 2) means unigrams and bigrams, and c(2, 2) means only bigrams.

split

splitting criteria for strings, default: " "

lowercase

convert all characters to lowercase before tokenizing

regex

regex expression to use for text cleaning.

remove_stopwords

a list of stopwords to use, by default it uses its inbuilt list of standard stopwords

smooth_idf

logical, to prevent zero division, adds one to document frequencies, as if an extra document was seen containing every term in the collection exactly once

norm

logical, if TRUE, each output row will have unit norm ‘l2’: Sum of squares of vector elements is 1. if FALSE returns non-normalized vectors, default: TRUE

Methods


Method new()

Usage

TfIdfVectorizer$new(
  min_df,
  max_df,
  max_features,
  ngram_range,
  regex,
  remove_stopwords,
  split,
  lowercase,
  smooth_idf,
  norm
)

Arguments

min_df

numeric, When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold, value lies between 0 and 1.

max_df

numeric, When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold, value lies between 0 and 1.

max_features

integer, Build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

ngram_range

vector, The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of c(1, 1) means only unigrams, c(1, 2) means unigrams and bigrams, and c(2, 2) means only bigrams.

regex

character, regex expression to use for text cleaning.

remove_stopwords

list, a list of stopwords to use, by default it uses its inbuilt list of standard english stopwords

split

character, splitting criteria for strings, default: " "

lowercase

logical, convert all characters to lowercase before tokenizing, default: TRUE

smooth_idf

logical, to prevent zero division, adds one to document frequencies, as if an extra document was seen containing every term in the collection exactly once

norm

logical, if TRUE, each output row will have unit norm ‘l2’: Sum of squares of vector elements is 1. if FALSE returns non-normalized vectors, default: TRUE

parallel

logical, speeds up ngrams computation using n-1 cores, defaults: TRUE

Details

Create a new `TfIdfVectorizer` object.

Returns

A `TfIdfVectorizer` object.

Examples


Method fit()

Usage

TfIdfVectorizer$fit(sentences)

Arguments

sentences

a list of text sentences

Details

Fits the TfIdfVectorizer model on sentences

Returns

NULL

Examples

sents = c('i am alone in dark.','mother_mary a lot',
          'alone in the dark?', 'many mothers in the lot....')
tf = TfIdfVectorizer$new(smooth_idf = TRUE, min_df = 0.3)
tf$fit(sents)


Method fit_transform()

Usage

TfIdfVectorizer$fit_transform(sentences)

Arguments

sentences

a list of text sentences

Details

Fits the TfIdfVectorizer model and returns a sparse matrix of count of tokens

Returns

a sparse matrix containing tf-idf score for tokens in each given sentence

Examples

\dontrun{
sents <- c('i am alone in dark.','mother_mary a lot',
         'alone in the dark?', 'many mothers in the lot....')
tf <- TfIdfVectorizer$new(smooth_idf = TRUE, min_df = 0.1)
tf_matrix <- tf$fit_transform(sents)
}


Method transform()

Usage

TfIdfVectorizer$transform(sentences)

Arguments

sentences

a list of new text sentences

Details

Returns a matrix of tf-idf score of tokens

Returns

a sparse matrix containing tf-idf score for tokens in each given sentence

Examples

\dontrun{
sents = c('i am alone in dark.','mother_mary a lot',
          'alone in the dark?', 'many mothers in the lot....')
new_sents <- c("dark at night",'mothers day')
tf = TfIdfVectorizer$new(min_df=0.1)
tf$fit(sents)
tf_matrix <- tf$transform(new_sents)
}


Method clone()

The objects of this class are cloneable with this method.

Usage

TfIdfVectorizer$clone(deep = FALSE)

Arguments

deep

Whether to make a deep clone.

Examples


## ------------------------------------------------
## Method `TfIdfVectorizer$new`
## ------------------------------------------------

TfIdfVectorizer$new()
#> <TfIdfVectorizer>
#>   Inherits from: <CountVectorizer>
#>   Public:
#>     clone: function (deep = FALSE) 
#>     fit: function (sentences) 
#>     fit_transform: function (sentences) 
#>     initialize: function (min_df, max_df, max_features, ngram_range, regex, remove_stopwords, 
#>     lowercase: TRUE
#>     max_df: 1
#>     max_features: NULL
#>     min_df: 1
#>     model: NULL
#>     ngram_range: 1 1
#>     norm: TRUE
#>     regex: [^a-zA-Z0-9 ]
#>     remove_stopwords: TRUE
#>     sentences: NA
#>     smooth_idf: TRUE
#>     split:  
#>     transform: function (sentences) 
#>   Private:
#>     check_args: function (x, max_value, what) 
#>     get_bow_df: function (sentences, use_tokens = NULL) 
#>     get_tokens: function (sentences, min_df = 1, max_df = 1, ngram_range = NULL, 
#>     gettfmatrix: function (countmatrix, sentences, norm, smooth_idf = TRUE) 
#>     preprocess: function (sentences, regex = "[^0-9a-zA-Z ]", lowercase, remove_stopwords) 

## ------------------------------------------------
## Method `TfIdfVectorizer$fit`
## ------------------------------------------------

sents = c('i am alone in dark.','mother_mary a lot',
          'alone in the dark?', 'many mothers in the lot....')
tf = TfIdfVectorizer$new(smooth_idf = TRUE, min_df = 0.3)
tf$fit(sents)

## ------------------------------------------------
## Method `TfIdfVectorizer$fit_transform`
## ------------------------------------------------

if (FALSE) {
sents <- c('i am alone in dark.','mother_mary a lot',
         'alone in the dark?', 'many mothers in the lot....')
tf <- TfIdfVectorizer$new(smooth_idf = TRUE, min_df = 0.1)
tf_matrix <- tf$fit_transform(sents)
}

## ------------------------------------------------
## Method `TfIdfVectorizer$transform`
## ------------------------------------------------

if (FALSE) {
sents = c('i am alone in dark.','mother_mary a lot',
          'alone in the dark?', 'many mothers in the lot....')
new_sents <- c("dark at night",'mothers day')
tf = TfIdfVectorizer$new(min_df=0.1)
tf$fit(sents)
tf_matrix <- tf$transform(new_sents)
}