topkwords

Most important words in bag-of-words model or LDA topic

Syntax

tbl = topkwords(bag)

tbl = topkwords(bag,k)

tbl = topkwords(ldaMdl,k,topicIdx)

tbl = topkwords(___,Name,Value)

Description

tbl = topkwords(bag) returns a table of the five words with the largest word counts in bag-of-words model bag. The function, by default, is case sensitive.

example

tbl = topkwords(bag,k) returns a table of the k words with the largest word counts. The function, by default, is case sensitive.

example

tbl = topkwords(ldaMdl,k,topicIdx) returns a table of the k words with the highest probabilities in the latent Dirichlet allocation (LDA) topic topicIdx in the LDA model ldaMdl.

example

tbl = topkwords(___,Name,Value) specifies additional options using one or more name-value pair arguments.

example

Examples

collapse all

Most Frequent Words of Bag-of-Words Model

Open Live Script

Create a table of the most frequent words of a bag-of-words model.

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-words model using bagOfWords.

bag = bagOfWords(documents)

bag = 
  bagOfWords with properties:

        NumWords: 3092
          Counts: [154×3092 double]
      Vocabulary: ["fairest"    "creatures"    "desire"    "increase"    "thereby"    "beautys"    "rose"    "might"    "never"    "die"    "riper"    "time"    "decease"    "tender"    "heir"    "bear"    "memory"    "thou"    …    ] (1×3092 string)
    NumDocuments: 154

Find the top five words.

T = topkwords(bag);

Find the top 20 words in the model.

k = 20;
T = topkwords(bag,k)

T=20×2 table
      Word      Count
    ________    _____

    "thy"        281 
    "thou"       234 
    "love"       162 
    "thee"       161 
    "doth"        88 
    "mine"        63 
    "shall"       59 
    "eyes"        56 
    "sweet"       55 
    "time"        53 
    "beauty"      52 
    "nor"         52 
    "art"         51 
    "yet"         51 
    "o"           50 
    "heart"       50 
      ⋮

Highest Probability Words of LDA Topic

Open Live Script

Create a table of the words with highest probability of an LDA topic.

To reproduce the results, set rng to 'default'.

rng('default')

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-words model using bagOfWords.

bag = bagOfWords(documents);

Fit an LDA model with 20 topics. To suppress verbose output, set 'Verbose' to 0.

numTopics = 20;
mdl = fitlda(bag,numTopics,'Verbose',0);

Find the top 20 words of the first topic.

k = 20;
topicIdx = 1;
tbl = topkwords(mdl,k,topicIdx)

tbl=20×2 table
      Word        Score  
    ________    _________

    "eyes"        0.11155
    "beauty"      0.05777
    "hath"       0.055778
    "still"      0.049801
    "true"       0.043825
    "mine"       0.033865
    "find"       0.031873
    "black"      0.025897
    "look"       0.023905
    "tis"        0.023905
    "kind"       0.021913
    "seen"       0.021913
    "found"      0.017929
    "sin"        0.015937
    "three"      0.013945
    "golden"    0.0099608
      ⋮

Find the top 20 words of the first topic and use inverse mean scaling on the scores.

tbl = topkwords(mdl,k,topicIdx,'Scaling','inversemean')

tbl=20×2 table
      Word       Score  
    ________    ________

    "eyes"        1.2718
    "beauty"     0.59022
    "hath"        0.5692
    "still"      0.50269
    "true"       0.43719
    "mine"       0.32764
    "find"       0.32544
    "black"      0.25931
    "tis"        0.23755
    "look"       0.22519
    "kind"       0.21594
    "seen"       0.21594
    "found"      0.17326
    "sin"        0.15223
    "three"      0.13143
    "golden"    0.090698
      ⋮

Create a word cloud using the scaled scores as the size data.

figure
wordcloud(tbl.Word,tbl.Score);

Figure contains an object of type wordcloud.

Input Arguments

collapse all

`bag` — Input bag-of-words model
`bagOfWords` object

Input bag-of-words model, specified as a bagOfWords object.

`k` — Number of words
positive integer | `Inf`

Number of words to return, specified as a positive integer or Inf.

If k is Inf, then the function returns all words. For bag-of-words and LDA model input, the function sorts the words in order of frequency and importance, respectively.

Example: 20

`ldaMdl` — Input LDA model
`ldaModel` object

Input LDA model, specified as an ldaModel object.

`topicIdx` — Index of LDA topic
nonnegative integer

Index of LDA topic, specified as a nonnegative integer.

Name-Value Arguments

expand all

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: 'Scaling','inversemean' specifies to use inverse mean scaling on the topic word probabilities.

Bag-of-Words Model Options

expand all

`IgnoreCase` — Option to ignore case
`false` (default) | `true`

Option to ignore case, specified as the comma-separated pair consisting of 'IgnoreCase' and one of the following:

false – treat words differing only by case as separate words.
true – treat words differing only by case as the same word and merge counts.

This option supports bag-of-words input only.

`ForceCellOutput` — Indicator for forcing output to be returned as cell array
`false` (default) | `true`

Indicator for forcing output to be returned as cell array, specified as the comma separated pair consisting of 'ForceCellOutput' and true or false.

This option supports bag-of-words input only.

Data Types: logical

LDA Model Options

expand all

`Scaling` — Scaling to apply to topic word probabilities
`'none'` (default) | `'inversemean'`

Scaling to apply to topic word probabilities, specified as the comma-separated pair consisting of 'Scaling' and one of the following:

'none' – Return posterior word probabilities.
'inversemean' – Normalize the posterior word probabilities per topic by the geometric mean of the posterior probabilities for this word across all topics. The function uses the formula Phi.*(log(Phi)-mean(log(Phi),1)), where Phi corresponds to ldaMdl.TopicWordProbabilities.

This option supports LDA model input only.

Example: 'Scaling','inversemean'

Data Types: char

Output Arguments

collapse all

`tbl` — Top words
table | cell array of tables

Top words, returned as a table or a cell array of tables. For bag-of-words and LDA model input, the function sorts the words in order of frequency and importance, respectively.

When the input is a bag-of-words model, the table has the following columns:

`Word`	Word specified as a string
`Count`	Number of times the word appears in the bag-of-words model

If bag is a non-scalar array or 'ForceCellOutput' is true, then the function returns the outputs as a cell array of tables. Each element in the cell array is a table containing the top words of the corresponding element of bag.

When the input is an LDA model, the table has the following columns:

`Word`	Word specified as a string
`Score`	Word probability for the given LDA topic

Tips

To find the most frequently seen n-grams in a bag-of-n-grams model, use topkngrams.

Version History

Introduced in R2017b

topkwords

Syntax

Description

Examples

Most Frequent Words of Bag-of-Words Model

Highest Probability Words of LDA Topic

Input Arguments

bag — Input bag-of-words model bagOfWords object

k — Number of words positive integer | Inf

ldaMdl — Input LDA model ldaModel object

topicIdx — Index of LDA topic nonnegative integer

Name-Value Arguments

Bag-of-Words Model Options

IgnoreCase — Option to ignore case false (default) | true

ForceCellOutput — Indicator for forcing output to be returned as cell array false (default) | true

LDA Model Options

Scaling — Scaling to apply to topic word probabilities 'none' (default) | 'inversemean'

Output Arguments

tbl — Top words table | cell array of tables

Tips

Version History

See Also

Topics

`bag` — Input bag-of-words model
`bagOfWords` object

`k` — Number of words
positive integer | `Inf`

`ldaMdl` — Input LDA model
`ldaModel` object

`topicIdx` — Index of LDA topic
nonnegative integer

`IgnoreCase` — Option to ignore case
`false` (default) | `true`

`ForceCellOutput` — Indicator for forcing output to be returned as cell array
`false` (default) | `true`

`Scaling` — Scaling to apply to topic word probabilities
`'none'` (default) | `'inversemean'`

`tbl` — Top words
table | cell array of tables