topkwords

Most important words in bag-of-words model or LDA topic

Description

example

tbl = topkwords(bag) returns a table of the five words with the largest word counts in bag-of-words model bag.

example

tbl = topkwords(bag,k) returns a table of the k words with the largest word counts.

example

tbl = topkwords(ldaMdl,k,topicIdx) returns a table of the k words with the highest probabilities in the latent Dirichlet allocation (LDA) topic topicIdx in the LDA model ldaMdl.

example

tbl = topkwords(___,Name,Value) specifies additional options using one or more name-value pair arguments.

Examples

collapse all

Create a table of the most frequent words of a bag-of-words model.

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-words model using bagOfWords.

bag = bagOfWords(documents) 
bag = 
  bagOfWords with properties:

          Counts: [154x3092 double]
      Vocabulary: [1x3092 string]
        NumWords: 3092
    NumDocuments: 154

Find the top five words.

T = topkwords(bag);

Find the top 20 words in the model.

k = 20;
T = topkwords(bag,k)
T=20×2 table
      Word      Count
    ________    _____

    "thy"        281 
    "thou"       234 
    "love"       162 
    "thee"       161 
    "doth"        88 
    "mine"        63 
    "shall"       59 
    "eyes"        56 
    "sweet"       55 
    "time"        53 
    "beauty"      52 
    "nor"         52 
    "art"         51 
    "yet"         51 
    "o"           50 
    "heart"       50 
      ⋮

Create a table of the words with highest probability of an LDA topic.

To reproduce the results, set rng to 'default'.

rng('default')

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-words model using bagOfWords.

bag = bagOfWords(documents);

Fit an LDA model with 20 topics. To suppress verbose output, set 'Verbose' to 0.

numTopics = 20;
mdl = fitlda(bag,numTopics,'Verbose',0);

Find the top 20 words of the first topic.

k = 20;
topicIdx = 1;
tbl = topkwords(mdl,k,topicIdx)
tbl=20×2 table
      Word        Score  
    ________    _________

    "eyes"        0.11155
    "beauty"      0.05777
    "hath"       0.055778
    "still"      0.049801
    "true"       0.043825
    "mine"       0.033865
    "find"       0.031873
    "black"      0.025897
    "look"       0.023905
    "tis"        0.023905
    "kind"       0.021913
    "seen"       0.021913
    "found"      0.017929
    "sin"        0.015937
    "three"      0.013945
    "golden"    0.0099608
      ⋮

Find the top 20 words of the first topic and use inverse mean scaling on the scores.

tbl = topkwords(mdl,k,topicIdx,'Scaling','inversemean')
tbl=20×2 table
      Word       Score  
    ________    ________

    "eyes"        1.2718
    "beauty"     0.59022
    "hath"        0.5692
    "still"      0.50269
    "true"       0.43719
    "mine"       0.32764
    "find"       0.32544
    "black"      0.25931
    "tis"        0.23755
    "look"       0.22519
    "kind"       0.21594
    "seen"       0.21594
    "found"      0.17326
    "sin"        0.15223
    "three"      0.13143
    "golden"    0.090698
      ⋮

Create a word cloud using the scaled scores as the size data.

figure
wordcloud(tbl.Word,tbl.Score);

Input Arguments

collapse all

Input bag-of-words model, specified as a bagOfWords object.

Number of words to return, specified as a positive integer.

Example: 20

Input LDA model, specified as an ldaModel object.

Index of LDA topic, specified as a nonnegative integer.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'Scaling','inversemean' specifies to use inverse mean scaling on the topic word probabilities.

Indicator for forcing output to be returned as cell array, specified as the comma separated pair consisting of 'ForceCellOutput' and true or false.

This option only applies if the input data is a bag-of-words model.

Data Types: logical

Scaling to apply to topic word probabilities, specified as the comma-separated pair consisting of 'Scaling' and one of the following:

  • 'none' – Return posterior word probabilities.

  • 'inversemean' – Normalize the posterior word probabilities per topic by the geometric mean of the posterior probabilities for this word across all topics. The function uses the formula Phi.*(log(Phi)-mean(log(Phi),1)), where Phi corresponds to ldaMdl.TopicWordProbabilities.

This option only applies if the input data is an LDA model.

Example: 'Scaling','inversemean'

Data Types: char

Output Arguments

collapse all

Table of top words sorted in order of importance or a cell array of tables.

When the input is a bag-of-words model, the table has the following columns:

WordWord specified as a string
CountNumber of times the word appears in the bag-of-words model

If bag is a non-scalar array or 'ForceCellOutput' is true, then the function returns the outputs as a cell array of tables. Each element in the cell array is a table containing the top words of the corresponding element of bag.

When the input is an LDA model, the table has the following columns:

WordWord specified as a string
ScoreWord probability for the given LDA topic

Tips

  • To find the most frequently seen n-grams in a bag-of-n-grams model, use topkngrams.

Introduced in R2017b