removeNgrams

Remove n-grams from bag-of-n-grams model

Description

example

newBag = removeNgrams(bag,ngrams) removes the specified n-grams from the bag-of-n-grams model bag.

example

newBag = removeNgrams(bag,idx) specifies n-grams by numeric or logical indices in bag.Ngrams. This syntax is the same as newBag = removeNgrams(bag,bag.Ngrams(idx,:)).

Examples

collapse all

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create bag-of-n-grams model.

bag = bagOfNgrams(documents)
bag = 
  bagOfNgrams with properties:

          Counts: [154×8799 double]
      Vocabulary: [1×3092 string]
          Ngrams: [8799×2 string]
    NgramLengths: 2
       NumNgrams: 8799
    NumDocuments: 154

View the top five n-grams.

topkngrams(bag,5)
ans=5×3 table
         Ngram          Count    NgramLength
    ________________    _____    ___________

    "thou"    "art"      34           2     
    "mine"    "eye"      15           2     
    "thy"     "self"     14           2     
    "thou"    "dost"     13           2     
    "mine"    "own"      13           2     

Remove the n-grams ["thou" "art"] and ["thou" "dost"] from the model. View the new top 5 n-grams.

ngrams = [...
    "thou" "art"
    "thou" "dost"];
bag = removeNgrams(bag,ngrams);
topkngrams(bag,5)
ans=5×3 table
          Ngram          Count    NgramLength
    _________________    _____    ___________

    "mine"    "eye"       15           2     
    "thy"     "self"      14           2     
    "mine"    "own"       13           2     
    "thy"     "sweet"     12           2     
    "thy"     "love"      11           2     

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create bag-of-n-grams model.

bag = bagOfNgrams(documents)
bag = 
  bagOfNgrams with properties:

          Counts: [154x8799 double]
      Vocabulary: [1x3092 string]
          Ngrams: [8799x2 string]
    NgramLengths: 2
       NumNgrams: 8799
    NumDocuments: 154

View the first ten n-grams in the model.

bag.Ngrams(1:10,:)
ans = 10x2 string array
    "fairest"      "creatures"
    "creatures"    "desire"   
    "desire"       "increase" 
    "increase"     "thereby"  
    "thereby"      "beautys"  
    "beautys"      "rose"     
    "rose"         "might"    
    "might"        "never"    
    "never"        "die"      
    "die"          "riper"    

Remove the 9th and 10th n-grams from the model. View the new list of the first ten n-grams.

idx = [9 10];
bag = removeNgrams(bag,idx);
bag.Ngrams(1:10,:)
ans = 10x2 string array
    "fairest"      "creatures"
    "creatures"    "desire"   
    "desire"       "increase" 
    "increase"     "thereby"  
    "thereby"      "beautys"  
    "beautys"      "rose"     
    "rose"         "might"    
    "might"        "never"    
    "riper"        "time"     
    "time"         "decease"  

Input Arguments

collapse all

Input bag-of-n-grams model, specified as a bagOfNgrams object.

N-grams to remove, specified as a string array, character vector, or a cell array of character vectors.

If ngrams is a string array or cell array, then it has size NumNgrams-by-maxN , where NumNgrams is the number of n-grams, and maxN is the length of the largest n-gram. If ngrams is a character vector, then it represents a single word (unigram).

The value of ngrams(i,j) is the jth word of the ith n-gram. If the number of words in the ith n-gram is less than maxN, then the remaining entries of the ith row of ngrams are empty.

Example: ["An" ""; "An example"; "example" ""]

Data Types: string | char | cell

Indices of n-grams to remove, specified as a vector of numeric indices or a vector of logical indices. The indices in idx correspond to the rows of the bag.Ngrams.

Example: [1 5 10]

Introduced in R2018a