tfidf

Term Frequency–Inverse Document Frequency (tf-idf) matrix

Description

example

M = tfidf(bag) returns a Term Frequency-Inverse Document Frequency (tf-idf) matrix based on the bag-of-words or bag-of-n-grams model bag.

example

M = tfidf(bag,documents) returns a tf-idf matrix for the documents in documents by using the inverse document frequency (IDF) factor computed from bag.

example

M = tfidf(___,Name,Value) specifies additional options using one or more name-value pair arguments.

Examples

collapse all

Create a Term Frequency–Inverse Document Frequency (tf-idf) matrix from a bag-of-words model.

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-words model using bagOfWords.

bag = bagOfWords(documents)
bag = 
  bagOfWords with properties:

          Counts: [154x3092 double]
      Vocabulary: [1x3092 string]
        NumWords: 3092
    NumDocuments: 154

Create a tf-idf matrix. View the first 10 rows and columns.

M = tfidf(bag);
full(M(1:10,1:10))
ans = 10×10

    3.6507    4.3438    2.7344    3.6507    4.3438    2.2644    3.2452    3.8918    2.4720    2.5520
         0         0         0         0         0    4.5287         0         0         0         0
         0         0         0         0         0         0         0         0         0    2.5520
         0         0         0         0         0    2.2644         0         0         0         0
         0         0         0         0         0    2.2644         0         0         0         0
         0         0         0         0         0    2.2644         0         0         0         0
         0         0         0         0         0         0         0         0         0         0
         0         0         0         0         0         0         0         0         0         0
         0         0         0         0         0    2.2644         0         0         0    2.5520
         0         0    2.7344         0         0         0         0         0         0         0

Create a Term Frequency-Inverse Document Frequency (tf-idf) matrix from a bag-of-words model and an array of new documents.

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-words model from the documents.

bag = bagOfWords(documents) 
bag = 
  bagOfWords with properties:

          Counts: [154x3092 double]
      Vocabulary: [1x3092 string]
        NumWords: 3092
    NumDocuments: 154

Create a tf-idf matrix for an array of new documents using the inverse document frequency (IDF) factor computed from bag.

newDocuments = tokenizedDocument([
    "what's in a name? a rose by any other name would smell as sweet."
    "if music be the food of love, play on."]);
M = tfidf(bag,newDocuments)
M = 
   (1,7)       3.2452
   (1,36)      1.2303
   (2,197)     3.4275
   (2,313)     3.6507
   (2,387)     0.6061
   (1,1205)    4.7958
   (1,1835)    3.6507
   (2,1917)    5.0370

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-words model using bagOfWords.

bag = bagOfWords(documents)
bag = 
  bagOfWords with properties:

          Counts: [154x3092 double]
      Vocabulary: [1x3092 string]
        NumWords: 3092
    NumDocuments: 154

Create a tf-idf matrix. View the first 10 rows and columns.

M = tfidf(bag);
full(M(1:10,1:10))
ans = 10×10

    3.6507    4.3438    2.7344    3.6507    4.3438    2.2644    3.2452    3.8918    2.4720    2.5520
         0         0         0         0         0    4.5287         0         0         0         0
         0         0         0         0         0         0         0         0         0    2.5520
         0         0         0         0         0    2.2644         0         0         0         0
         0         0         0         0         0    2.2644         0         0         0         0
         0         0         0         0         0    2.2644         0         0         0         0
         0         0         0         0         0         0         0         0         0         0
         0         0         0         0         0         0         0         0         0         0
         0         0         0         0         0    2.2644         0         0         0    2.5520
         0         0    2.7344         0         0         0         0         0         0         0

You can change the contributions made by the TF and IDF factors to the tf-idf matrix by specifying the TF and IDF weight formulas.

To ignore how many times a word appears in a document, use the binary option of 'TFWeight'. Create a tf-idf matrix and set 'TFWeight' to 'binary'. View the first 10 rows and columns.

M = tfidf(bag,'TFWeight','binary');
full(M(1:10,1:10))
ans = 10×10

    3.6507    4.3438    2.7344    3.6507    4.3438    2.2644    3.2452    1.9459    2.4720    2.5520
         0         0         0         0         0    2.2644         0         0         0         0
         0         0         0         0         0         0         0         0         0    2.5520
         0         0         0         0         0    2.2644         0         0         0         0
         0         0         0         0         0    2.2644         0         0         0         0
         0         0         0         0         0    2.2644         0         0         0         0
         0         0         0         0         0         0         0         0         0         0
         0         0         0         0         0         0         0         0         0         0
         0         0         0         0         0    2.2644         0         0         0    2.5520
         0         0    2.7344         0         0         0         0         0         0         0

Input Arguments

collapse all

Input bag-of-words or bag-of-n-grams model, specified as a bagOfWords object or a bagOfNgrams object.

Input documents, specified as a tokenizedDocument array, a string array of words, or a cell array of character vectors. If documents is a string array or a cell array of character vectors, then it must be a row vector representing a single document, where each element is a word.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'Normalized',true specifies to normalize the frequency counts.

Method to set term frequency (TF) factor, specified as the comma-separated pair consisting of 'TFWeight' and one of the following:

  • 'raw' – Set the TF factor to the unchanged term counts.

  • 'binary' – Set the TF factor to the matrix of ones and zeros where the ones indicate whether a term is in a document.

  • 'log' – Set the TF factor to 1 + log(bag.Counts).

Example: 'TFWeight','binary'

Data Types: char

Method to set inverse document frequency (IDF) factor, specified as the comma-separated pair consisting of 'IDFWeight' and one of the following:

  • 'normal' – Set the IDF factor to log(N/NT).

  • 'unary' – Set the IDF factor to 1.

  • 'smooth' – Set the IDF factor to log(1+N/NT).

  • 'max' – Set the IDF factor to log(1+max(NT)/NT).

  • 'probabilistic' – Set the IDF factor to log((N-NT)/NT).

where N is the number of documents in the bag, and NT is the number of documents containing each term which is equivalent to sum(bag.Counts).

Example: 'IDFWeight','smooth'

Data Types: char

Option to normalize term counts, specified as the comma-separated pair consisting of 'Normalized' and true or false. If true, then the function normalizes each vector of term counts in the Euclidean norm.

Example: 'Normalized',true

Data Types: logical

Orientation of output documents in the frequency count matrix, specified as the comma-separated pair consisting of 'DocumentsIn' and one of the following:

  • 'rows' – Return a matrix of frequency counts with rows corresponding to documents.

  • 'columns' – Return a transposed matrix of frequency counts with columns corresponding to documents.

Data Types: char

Indicator for forcing output to be returned as cell array, specified as the comma separated pair consisting of 'ForceCellOutput' and true or false.

Data Types: logical

Output Arguments

collapse all

Output Term Frequency-Inverse Document Frequency matrix, specified as a sparse matrix or a cell array of sparse matrices.

If bag is a non-scalar array or 'ForceCellOutput' is true, then the function returns the outputs as a cell array of sparse matrices. Each element in the cell array is the tf-idf matrix calculated from the corresponding element of bag.

Introduced in R2017b