bagOfNGrams

Bag-of-n-grams model

Description

A bag-of-n-grams model records the number of times that each n-gram appears in each document of a collection. An n-gram is a collection of n successive words.

bagOfNgrams does not split text into words. To create an array of tokenized documents, see tokenizedDocument.

Creation

Syntax

bag = bagOfNgrams

bag = bagOfNgrams(documents)

bag = bagOfNgrams(___,'NgramLengths',lengths)

bag = bagOfNgrams(uniqueNgrams,counts)

Description

bag = bagOfNgrams creates an empty bag-of-n-grams model.

bag = bagOfNgrams(documents) creates a bag-of-n-grams model and counts the bigrams (pairs of words) in documents.

example

bag = bagOfNgrams(___,'NgramLengths',lengths) counts n-grams of the specified lengths using any of the previous syntaxes.

example

bag = bagOfNgrams(uniqueNgrams,counts) creates a bag-of-n-grams model using the n-grams in uniqueNgrams and the corresponding frequency counts in counts. If uniqueNgrams contains <missing> values, then the corresponding values in counts are ignored.

example

Input Arguments

expand all

`documents` — Input documents
`tokenizedDocument` array | string array | cell array of character vectors

Input documents, specified as a tokenizedDocument array, a string array of words, or a cell array of character vectors. If documents is not a tokenizedDocument array, then it must be a row vector representing a single document, where each element is a word. To specify multiple documents, use a tokenizedDocument array.

`uniqueNgrams` — Unique n-gram list
string array | cell array of character vectors

Unique n-gram list, specified as a NumNgrams-by-maxN string array or cell array of character vectors, where NumNgrams is the number of unique n-grams, and maxN is the length of the largest n-gram.

The value of uniqueNgrams(i,j) is the jth word of the ith n-gram. If the number of words in the ith n-gram is less than maxN, then the remaining entries of the ith row of uniqueNgrams are empty.

If uniqueNgrams contains <missing>, then the function ignores the corresponding values in counts.

Each n-gram must have at least one word.

Example: ["An" ""; "An" "example"; "example" ""]

Data Types: string | cell

`counts` — Frequency counts of n-grams
matrix of nonnegative integers

Frequency counts of n-grams corresponding to the rows of uniqueNgrams, specified as a matrix of nonnegative integers. The value counts(i,j) corresponds to the number of times the n-gram uniqueNgrams(j,:) appears in the ith document.

counts must have as many columns as uniqueNgrams has rows.

`lengths` — Lengths of n-grams
2 (default) | positive integer | vector of positive integers

Lengths of n-grams, specified as a positive integer or a vector of positive integers.

Properties

expand all

`Counts` — N-gram counts per document
sparse matrix

N-gram counts per document, specified as a sparse matrix.

`Ngrams` — Unique n-grams in model
string array

Unique n-grams in the model, specified as a string array. Ngrams(i,j) is the jth word of the ith n-gram. If the number of columns of Ngrams is greater than the number of words in the n-gram, then the remaining entries are empty.

`NgramLengths` — Lengths of n-grams
2 (default) | positive integer | vector of positive integers

Lengths of n-grams, specified as a positive integer or a vector of positive integers.

`Vocabulary` — Unique words in model
string vector

Unique words in the model, specified as a string vector.

Data Types: string

`NumNgrams` — Number of n-grams seen
nonnegative integer

Number of n-grams seen, specified as a nonnegative integer.

`NumDocuments` — Number of documents seen
nonnegative integer

Number of documents seen, specified as a nonnegative integer.

Object Functions

`encode`	Encode documents as matrix of word or n-gram counts
`tfidf`	Term Frequency–Inverse Document Frequency (tf-idf) matrix
`topkngrams`	Most frequent n-grams
`addDocument`	Add documents to bag-of-words or bag-of-n-grams model
`removeDocument`	Remove documents from bag-of-words or bag-of-n-grams model
`removeEmptyDocuments`	Remove empty documents from tokenized document array, bag-of-words model, or bag-of-n-grams model
`removeNgrams`	Remove n-grams from bag-of-n-grams model
`removeInfrequentNgrams`	Remove infrequently seen n-grams from bag-of-n-grams model
`join`	Combine multiple bag-of-words or bag-of-n-grams models
`wordcloud`	Create word cloud chart from text, bag-of-words model, bag-of-n-grams model, or LDA model

Examples

collapse all

Create Bag-of-N-Grams Model

Open Live Script

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);
documents(1:10)

ans = 
  10×1 tokenizedDocument:

    70 tokens: fairest creatures desire increase thereby beautys rose might never die riper time decease tender heir might bear memory thou contracted thine own bright eyes feedst thy lights flame selfsubstantial fuel making famine abundance lies thy self thy foe thy sweet self cruel thou art worlds fresh ornament herald gaudy spring thine own bud buriest thy content tender churl makst waste niggarding pity world else glutton eat worlds due grave thee
    71 tokens: forty winters shall besiege thy brow dig deep trenches thy beautys field thy youths proud livery gazed tatterd weed small worth held asked thy beauty lies treasure thy lusty days say thine own deep sunken eyes alleating shame thriftless praise praise deservd thy beautys thou couldst answer fair child mine shall sum count make old excuse proving beauty succession thine new made thou art old thy blood warm thou feelst cold
    65 tokens: look thy glass tell face thou viewest time face form another whose fresh repair thou renewest thou dost beguile world unbless mother fair whose uneard womb disdains tillage thy husbandry fond tomb selflove stop posterity thou art thy mothers glass thee calls back lovely april prime thou windows thine age shalt despite wrinkles thy golden time thou live rememberd die single thine image dies thee
    71 tokens: unthrifty loveliness why dost thou spend upon thy self thy beautys legacy natures bequest gives nothing doth lend frank lends free beauteous niggard why dost thou abuse bounteous largess thee give profitless usurer why dost thou great sum sums yet canst live traffic thy self alone thou thy self thy sweet self dost deceive nature calls thee gone acceptable audit canst thou leave thy unused beauty tombed thee lives th executor
    61 tokens: hours gentle work frame lovely gaze every eye doth dwell play tyrants same unfair fairly doth excel neverresting time leads summer hideous winter confounds sap checked frost lusty leaves quite gone beauty oersnowed bareness every summers distillation left liquid prisoner pent walls glass beautys effect beauty bereft nor nor remembrance flowers distilld though winter meet leese show substance still lives sweet
    68 tokens: let winters ragged hand deface thee thy summer ere thou distilld make sweet vial treasure thou place beautys treasure ere selfkilld forbidden usury happies pay willing loan thats thy self breed another thee ten times happier ten ten times thy self happier thou art ten thine ten times refigurd thee death thou shouldst depart leaving thee living posterity selfwilld thou art fair deaths conquest make worms thine heir
    64 tokens: lo orient gracious light lifts up burning head eye doth homage newappearing sight serving looks sacred majesty climbd steepup heavenly hill resembling strong youth middle age yet mortal looks adore beauty still attending golden pilgrimage highmost pitch weary car like feeble age reeleth day eyes fore duteous converted low tract look another way thou thyself outgoing thy noon unlookd diest unless thou get son
    70 tokens: music hear why hearst thou music sadly sweets sweets war joy delights joy why lovst thou thou receivst gladly else receivst pleasure thine annoy true concord welltuned sounds unions married offend thine ear sweetly chide thee confounds singleness parts thou shouldst bear mark string sweet husband another strikes mutual ordering resembling sire child happy mother pleasing note sing whose speechless song many seeming sings thee thou single wilt prove none
    70 tokens: fear wet widows eye thou consumst thy self single life ah thou issueless shalt hap die world wail thee like makeless wife world thy widow still weep thou form thee hast left behind every private widow well keep childrens eyes husbands shape mind look unthrift world doth spend shifts place still world enjoys beautys waste hath world end kept unused user destroys love toward others bosom sits murdrous shame commits
    69 tokens: shame deny thou bearst love thy self art unprovident grant thou wilt thou art belovd many thou none lovst evident thou art possessd murderous hate gainst thy self thou stickst conspire seeking beauteous roof ruinate repair thy chief desire o change thy thought change mind shall hate fairer lodgd gentle love thy presence gracious kind thyself least kindhearted prove make thee another self love beauty still live thine thee

Create a bag-of-n-grams model.

bag = bagOfNgrams(documents)

bag = 
  bagOfNgrams with properties:

          Counts: [154×8799 double]
      Vocabulary: ["fairest"    "creatures"    "desire"    "increase"    "thereby"    "beautys"    "rose"    "might"    "never"    "die"    "riper"    "time"    "decease"    "tender"    "heir"    "bear"    "memory"    "thou"    "contracted"    …    ]
          Ngrams: [8799×2 string]
    NgramLengths: 2
       NumNgrams: 8799
    NumDocuments: 154

Visualize the model using a word cloud.

figure 
wordcloud(bag);

Figure contains an object of type wordcloud.

Count N-Grams of Different Lengths

Open Live Script

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-n-grams model. To count n-grams of length 2 and 3 (bigrams and trigrams), specify 'NgramLengths' to be the vector [2 3].

bag = bagOfNgrams(documents,'NgramLengths',[2 3])

bag = 
  bagOfNgrams with properties:

          Counts: [154×18022 double]
      Vocabulary: [1×3092 string]
          Ngrams: [18022×3 string]
    NgramLengths: [2 3]
       NumNgrams: 18022
    NumDocuments: 154

View the 10 most common n-grams of length 2 (bigrams).

topkngrams(bag,10,'NGramLengths',2)

ans=10×3 table
             Ngram             Count    NgramLength
    _______________________    _____    ___________

    "thou"    "art"      ""     34           2     
    "mine"    "eye"      ""     15           2     
    "thy"     "self"     ""     14           2     
    "thou"    "dost"     ""     13           2     
    "mine"    "own"      ""     13           2     
    "thy"     "sweet"    ""     12           2     
    "thy"     "love"     ""     11           2     
    "dost"    "thou"     ""     10           2     
    "thou"    "wilt"     ""     10           2     
    "love"    "thee"     ""      9           2

View the 10 most common n-grams of length 3 (trigrams).

 topkngrams(bag,10,'NGramLengths',3)

ans=10×3 table
               Ngram                Count    NgramLength
    ____________________________    _____    ___________

    "thy"     "sweet"    "self"       4           3     
    "why"     "dost"     "thou"       4           3     
    "thy"     "self"     "thy"        3           3     
    "thou"    "thy"      "self"       3           3     
    "mine"    "eye"      "heart"      3           3     
    "thou"    "shalt"    "find"       3           3     
    "fair"    "kind"     "true"       3           3     
    "thou"    "art"      "fair"       2           3     
    "love"    "thy"      "self"       2           3     
    "thy"     "self"     "thou"       2           3

Create Bag-of-N-Grams Model from Unique N-Grams and Counts

Open Live Script

Create a bag-of-n-grams model using a string array of unique n-grams and a matrix of counts.

Load the example n-grams and counts from sonnetsBigramCounts.mat. This file contains a string array uniqueNgrams, which contains the unique n-grams, and the matrix counts, which contains the n-gram frequency counts.

load sonnetsBigramCounts.mat

View the first few n-grams in uniqueNgrams.

uniqueNgrams(1:10,:)

ans = 10×2 string
    "fairest"      "creatures"
    "creatures"    "desire"   
    "desire"       "increase" 
    "increase"     "thereby"  
    "thereby"      "beautys"  
    "beautys"      "rose"     
    "rose"         "might"    
    "might"        "never"    
    "never"        "die"      
    "die"          "riper"

Create the bag-of-n-grams model.

bag = bagOfNgrams(uniqueNgrams,counts)

bag = 
  bagOfNgrams with properties:

          Ngrams: [8799×2 string]
    NgramLengths: 2
       NumNgrams: 8799
          Counts: [154×8799 double]
      Vocabulary: ["fairest"    "creatures"    "desire"    "increase"    "thereby"    "beautys"    "rose"    "might"    "never"    "die"    "riper"    "time"    "decease"    "tender"    "heir"    "bear"    "memory"    "thou"    …    ] (1×3092 string)
    NumDocuments: 154

Version History

Introduced in R2018a

bagOfNGrams

Description

Creation

Syntax

Description

Input Arguments

`documents` — Input documents
`tokenizedDocument` array | string array | cell array of character vectors

`uniqueNgrams` — Unique n-gram list
string array | cell array of character vectors

`counts` — Frequency counts of n-grams
matrix of nonnegative integers

`lengths` — Lengths of n-grams
2 (default) | positive integer | vector of positive integers

Properties

`Counts` — N-gram counts per document
sparse matrix

`Ngrams` — Unique n-grams in model
string array

`NgramLengths` — Lengths of n-grams
2 (default) | positive integer | vector of positive integers

`Vocabulary` — Unique words in model
string vector

`NumNgrams` — Number of n-grams seen
nonnegative integer

`NumDocuments` — Number of documents seen
nonnegative integer

Object Functions

Examples

Create Bag-of-N-Grams Model

Count N-Grams of Different Lengths

Create Bag-of-N-Grams Model from Unique N-Grams and Counts

Version History

See Also

Topics

bagOfNGrams

Description

Creation

Syntax

Description

Input Arguments

documents — Input documents tokenizedDocument array | string array | cell array of character vectors

uniqueNgrams — Unique n-gram list string array | cell array of character vectors

counts — Frequency counts of n-grams matrix of nonnegative integers

lengths — Lengths of n-grams 2 (default) | positive integer | vector of positive integers

Properties

Counts — N-gram counts per document sparse matrix

Ngrams — Unique n-grams in model string array

NgramLengths — Lengths of n-grams 2 (default) | positive integer | vector of positive integers

Vocabulary — Unique words in model string vector

NumNgrams — Number of n-grams seen nonnegative integer

NumDocuments — Number of documents seen nonnegative integer

Object Functions

Examples

Create Bag-of-N-Grams Model

Count N-Grams of Different Lengths

Create Bag-of-N-Grams Model from Unique N-Grams and Counts

Version History

See Also

Topics

`documents` — Input documents
`tokenizedDocument` array | string array | cell array of character vectors

`uniqueNgrams` — Unique n-gram list
string array | cell array of character vectors

`counts` — Frequency counts of n-grams
matrix of nonnegative integers

`lengths` — Lengths of n-grams
2 (default) | positive integer | vector of positive integers

`Counts` — N-gram counts per document
sparse matrix

`Ngrams` — Unique n-grams in model
string array

`NgramLengths` — Lengths of n-grams
2 (default) | positive integer | vector of positive integers

`Vocabulary` — Unique words in model
string vector

`NumNgrams` — Number of n-grams seen
nonnegative integer

`NumDocuments` — Number of documents seen
nonnegative integer