topkwords
Most important words in bag-of-words model or LDA topic
Syntax
Description
specifies additional options using one or more name-value pair arguments.tbl
= topkwords(___,Name,Value
)
Examples
Most Frequent Words of Bag-of-Words Model
Create a table of the most frequent words of a bag-of-words model.
Load the example data. The file sonnetsPreprocessed.txt
contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt
, split the text into documents at newline characters, and then tokenize the documents.
filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);
Create a bag-of-words model using bagOfWords
.
bag = bagOfWords(documents)
bag = bagOfWords with properties: Counts: [154x3092 double] Vocabulary: ["fairest" "creatures" "desire" "increase" "thereby" "beautys" "rose" "might" "never" "die" "riper" "time" "decease" "tender" "heir" "bear" "memory" "thou" ... ] (1x3092 string) NumWords: 3092 NumDocuments: 154
Find the top five words.
T = topkwords(bag);
Find the top 20 words in the model.
k = 20; T = topkwords(bag,k)
T=20×2 table
Word Count
________ _____
"thy" 281
"thou" 234
"love" 162
"thee" 161
"doth" 88
"mine" 63
"shall" 59
"eyes" 56
"sweet" 55
"time" 53
"beauty" 52
"nor" 52
"art" 51
"yet" 51
"o" 50
"heart" 50
⋮
Highest Probability Words of LDA Topic
Create a table of the words with highest probability of an LDA topic.
To reproduce the results, set rng
to 'default'
.
rng('default')
Load the example data. The file sonnetsPreprocessed.txt
contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt
, split the text into documents at newline characters, and then tokenize the documents.
filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);
Create a bag-of-words model using bagOfWords
.
bag = bagOfWords(documents);
Fit an LDA model with 20 topics. To suppress verbose output, set 'Verbose'
to 0.
numTopics = 20;
mdl = fitlda(bag,numTopics,'Verbose',0);
Find the top 20 words of the first topic.
k = 20; topicIdx = 1; tbl = topkwords(mdl,k,topicIdx)
tbl=20×2 table
Word Score
________ _________
"eyes" 0.11155
"beauty" 0.05777
"hath" 0.055778
"still" 0.049801
"true" 0.043825
"mine" 0.033865
"find" 0.031873
"black" 0.025897
"look" 0.023905
"tis" 0.023905
"kind" 0.021913
"seen" 0.021913
"found" 0.017929
"sin" 0.015937
"three" 0.013945
"golden" 0.0099608
⋮
Find the top 20 words of the first topic and use inverse mean scaling on the scores.
tbl = topkwords(mdl,k,topicIdx,'Scaling','inversemean')
tbl=20×2 table
Word Score
________ ________
"eyes" 1.2718
"beauty" 0.59022
"hath" 0.5692
"still" 0.50269
"true" 0.43719
"mine" 0.32764
"find" 0.32544
"black" 0.25931
"tis" 0.23755
"look" 0.22519
"kind" 0.21594
"seen" 0.21594
"found" 0.17326
"sin" 0.15223
"three" 0.13143
"golden" 0.090698
⋮
Create a word cloud using the scaled scores as the size data.
figure wordcloud(tbl.Word,tbl.Score);
Input Arguments
bag
— Input bag-of-words model
bagOfWords
object
Input bag-of-words model, specified as a bagOfWords
object.
k
— Number of words
positive integer | Inf
Number of words to return, specified as a positive integer or
Inf
.
If k
is Inf
, then the function
returns all words. For bag-of-words and LDA model input, the function sorts
the words in order of frequency and importance, respectively.
Example: 20
ldaMdl
— Input LDA model
ldaModel
object
Input LDA model, specified as an ldaModel
object.
topicIdx
— Index of LDA topic
nonnegative integer
Index of LDA topic, specified as a nonnegative integer.
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: 'Scaling','inversemean'
specifies to use inverse mean
scaling on the topic word probabilities.
IgnoreCase
— Option to ignore case
false
(default) | true
Option to ignore case, specified as the comma-separated pair
consisting of 'IgnoreCase'
and one of the following:
false
– treat words differing only by case as separate words.true
– treat words differing only by case as the same word and merge counts.
This option supports bag-of-words input only.
ForceCellOutput
— Indicator for forcing output to be returned as cell array
false
(default) | true
Indicator for forcing output to be returned as cell array, specified
as the comma separated pair consisting of
'ForceCellOutput'
and true
or
false
.
This option supports bag-of-words input only.
Data Types: logical
Scaling
— Scaling to apply to topic word probabilities
'none'
(default) | 'inversemean'
Scaling to apply to topic word probabilities, specified as the
comma-separated pair consisting of 'Scaling'
and one
of the following:
'none'
– Return posterior word probabilities.'inversemean'
– Normalize the posterior word probabilities per topic by the geometric mean of the posterior probabilities for this word across all topics. The function uses the formulaPhi.*(log(Phi)-mean(log(Phi),1))
, wherePhi
corresponds toldaMdl.TopicWordProbabilities
.
This option supports LDA model input only.
Example: 'Scaling','inversemean'
Data Types: char
Output Arguments
tbl
— Top words
table | cell array of tables
Top words, returned as a table or a cell array of tables. For bag-of-words and LDA model input, the function sorts the words in order of frequency and importance, respectively.
When the input is a bag-of-words model, the table has the following columns:
Word | Word specified as a string |
Count | Number of times the word appears in the bag-of-words model |
If bag
is a non-scalar array or
'ForceCellOutput'
is true
, then
the function returns the outputs as a cell array of tables. Each element in
the cell array is a table containing the top words of the corresponding
element of bag
.
When the input is an LDA model, the table has the following columns:
Word | Word specified as a string |
Score | Word probability for the given LDA topic |
Tips
To find the most frequently seen n-grams in a bag-of-n-grams model, use
topkngrams
.
Version History
Introduced in R2017b
See Also
bagOfWords
| bagOfNgrams
| removeInfrequentWords
| removeWords
| topkngrams
| tfidf
| ldaModel
| tokenizedDocument
Open Example
You have a modified version of this example. Do you want to open this example with your edits?
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)