removeNgrams

Remove n-grams from bag-of-n-grams model

Syntax

newBag = removeNgrams(bag,ngrams)

newBag = removeNgrams(bag,ngrams,'IgnoreCase',true)

newBag = removeNgrams(bag,idx)

Description

newBag = removeNgrams(bag,ngrams) removes the specified n-grams from the bag-of-n-grams model bag. The function, by default, is case sensitive.

example

newBag = removeNgrams(bag,ngrams,'IgnoreCase',true) removes n-grams ignoring case.

newBag = removeNgrams(bag,idx) specifies n-grams by numeric or logical indices in bag.Ngrams. This syntax is the same as newBag = removeNgrams(bag,bag.Ngrams(idx,:)).

example

Examples

collapse all

Remove N-Grams from Bag-of-N-Grams Model

Open Live Script

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create bag-of-n-grams model.

bag = bagOfNgrams(documents)

bag = 
  bagOfNgrams with properties:

          Counts: [154×8799 double]
      Vocabulary: [1×3092 string]
          Ngrams: [8799×2 string]
    NgramLengths: 2
       NumNgrams: 8799
    NumDocuments: 154

View the top five n-grams.

topkngrams(bag,5)

ans=5×3 table
    "thou","art"    34    2
    "mine","eye"    15    2
    "thy","self"    14    2
    "thou","dost"    13    2
    "mine","own"    13    2

Remove the n-grams ["thou" "art"] and ["thou" "dost"] from the model. View the new top 5 n-grams.

ngrams = [...
    "thou" "art"
    "thou" "dost"];
bag = removeNgrams(bag,ngrams);
topkngrams(bag,5)

ans=5×3 table
    "mine","eye"    15    2
    "thy","self"    14    2
    "mine","own"    13    2
    "thy","sweet"    12    2
    "thy","love"    11    2

Remove N-Grams from Bag-of-N-Grams Model by Index

Open Live Script

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create bag-of-n-grams model.

bag = bagOfNgrams(documents)

bag = 
  bagOfNgrams with properties:

          Ngrams: [8799×2 string]
    NgramLengths: 2
       NumNgrams: 8799
          Counts: [154×8799 double]
      Vocabulary: ["fairest"    "creatures"    "desire"    "increase"    "thereby"    "beautys"    "rose"    "might"    "never"    "die"    "riper"    "time"    "decease"    "tender"    "heir"    "bear"    "memory"    "thou"    …    ] (1×3092 string)
    NumDocuments: 154

View the first ten n-grams in the model.

bag.Ngrams(1:10,:)

ans = 10×2 string
    "fairest"      "creatures"
    "creatures"    "desire"   
    "desire"       "increase" 
    "increase"     "thereby"  
    "thereby"      "beautys"  
    "beautys"      "rose"     
    "rose"         "might"    
    "might"        "never"    
    "never"        "die"      
    "die"          "riper"

Remove the 9th and 10th n-grams from the model. View the new list of the first ten n-grams.

idx = [9 10];
bag = removeNgrams(bag,idx);
bag.Ngrams(1:10,:)

ans = 10×2 string
    "fairest"      "creatures"
    "creatures"    "desire"   
    "desire"       "increase" 
    "increase"     "thereby"  
    "thereby"      "beautys"  
    "beautys"      "rose"     
    "rose"         "might"    
    "might"        "never"    
    "riper"        "time"     
    "time"         "decease"

Input Arguments

collapse all

`bag` — Input bag-of-n-grams model
`bagOfNgrams` object

Input bag-of-n-grams model, specified as a bagOfNgrams object.

`ngrams` — N-grams to remove
string array | character vector | cell array of character vectors

N-grams to remove, specified as a string array, character vector, or a cell array of character vectors.

If ngrams is a string array or cell array, then it has size NumNgrams-by-maxN , where NumNgrams is the number of n-grams, and maxN is the length of the largest n-gram. If ngrams is a character vector, then it represents a single word (unigram).

The value of ngrams(i,j) is the jth word of the ith n-gram. If the number of words in the ith n-gram is less than maxN, then the remaining entries of the ith row of ngrams are empty.

Example: ["An" ""; "An example"; "example" ""]

Data Types: string | char | cell

`idx` — Indices of n-grams to remove
vector of numeric indices | vector of logical indices

Indices of n-grams to remove, specified as a vector of numeric indices or a vector of logical indices. The indices in idx correspond to the rows of the bag.Ngrams.

Example: [1 5 10]

Version History

Introduced in R2018a

removeNgrams

Syntax

Description

Examples

Remove N-Grams from Bag-of-N-Grams Model

Remove N-Grams from Bag-of-N-Grams Model by Index

Input Arguments

`bag` — Input bag-of-n-grams model
`bagOfNgrams` object

`ngrams` — N-grams to remove
string array | character vector | cell array of character vectors

`idx` — Indices of n-grams to remove
vector of numeric indices | vector of logical indices

Version History

See Also

Topics

removeNgrams

Syntax

Description

Examples

Remove N-Grams from Bag-of-N-Grams Model

Remove N-Grams from Bag-of-N-Grams Model by Index

Input Arguments

bag — Input bag-of-n-grams model bagOfNgrams object

ngrams — N-grams to remove string array | character vector | cell array of character vectors

idx — Indices of n-grams to remove vector of numeric indices | vector of logical indices

Version History

See Also

Topics

`bag` — Input bag-of-n-grams model
`bagOfNgrams` object

`ngrams` — N-grams to remove
string array | character vector | cell array of character vectors

`idx` — Indices of n-grams to remove
vector of numeric indices | vector of logical indices