fitlsa
Fit LSA model
Description
A latent semantic analysis (LSA) model discovers relationships between documents and the words that they contain. An LSA model is a dimensionality reduction tool useful for running low-dimensional statistical models on high-dimensional word counts. If the model was fit using a bag-of-n-grams model, then the software treats the n-grams as individual words.
fits an LSA model with mdl
= fitlsa(bag
,numComponents
)numComponents
components to the
bag-of-words or bag-of-n-grams model bag
.
fits an LSA model to the documents represented by the matrix of word counts
mdl
= fitlsa(counts
,numComponents
)counts
.
specifies additional options using one or more name-value pair arguments.mdl
= fitlsa(___,Name,Value
)
Examples
Fit LSA Model
Fit a Latent Semantic Analysis model to a collection of documents.
Load the example data. The file sonnetsPreprocessed.txt
contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt
, split the text into documents at newline characters, and then tokenize the documents.
filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);
Create a bag-of-words model using bagOfWords
.
bag = bagOfWords(documents)
bag = bagOfWords with properties: Counts: [154x3092 double] Vocabulary: ["fairest" "creatures" "desire" "increase" "thereby" "beautys" "rose" "might" "never" "die" "riper" "time" "decease" "tender" "heir" "bear" "memory" "thou" ... ] (1x3092 string) NumWords: 3092 NumDocuments: 154
Fit an LSA model with 20 components.
numComponents = 20; mdl = fitlsa(bag,numComponents)
mdl = lsaModel with properties: NumComponents: 20 ComponentWeights: [2.7866e+03 515.5889 443.6428 316.4191 295.4065 261.8927 226.1649 186.2160 170.6413 156.6033 151.5275 146.2553 141.6741 135.5318 134.1694 128.9931 124.2382 122.2931 116.5035 116.2590] DocumentScores: [154x20 double] WordScores: [3092x20 double] Vocabulary: ["fairest" "creatures" "desire" "increase" "thereby" "beautys" "rose" "might" "never" "die" "riper" "time" "decease" "tender" "heir" "bear" "memory" ... ] (1x3092 string) FeatureStrengthExponent: 2
Transform new documents into lower dimensional space using the LSA model.
newDocuments = tokenizedDocument([ "what's in a name? a rose by any other name would smell as sweet." "if music be the food of love, play on."]); dscores = transform(mdl,newDocuments)
dscores = 2×20
0.1338 0.1623 0.1680 -0.0541 -0.2464 0.0134 0.2604 0.0205 -0.1127 0.0627 0.3311 -0.2327 0.1689 -0.2695 0.0228 0.1241 0.1198 0.2535 -0.0607 0.0305
0.2547 0.5576 -0.0095 0.5660 -0.0643 0.1236 -0.0082 -0.0522 0.0690 -0.0330 0.0385 0.0803 -0.0373 0.0384 -0.0005 0.1943 0.0207 0.0278 0.0001 -0.0469
Fit LSA Model to Word Count Matrix
Load the example data. sonnetsCounts.mat
contains a matrix of word counts corresponding to preprocessed versions of Shakespeare's sonnets.
load sonnetsCounts.mat
size(counts)
ans = 1×2
154 3092
Fit LSA model with 20 components. Set the feature strength exponent to 4.
numComponents = 20; exponent = 4; mdl = fitlsa(counts,numComponents, ... 'FeatureStrengthExponent',exponent)
mdl = lsaModel with properties: NumComponents: 20 ComponentWeights: [2.7866e+03 515.5889 443.6428 316.4191 295.4065 261.8927 226.1649 186.2160 170.6413 156.6033 151.5275 146.2553 141.6741 135.5318 134.1694 128.9931 124.2382 122.2931 116.5035 116.2590] DocumentScores: [154x20 double] WordScores: [3092x20 double] Vocabulary: ["1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" ... ] (1x3092 string) FeatureStrengthExponent: 4
Input Arguments
bag
— Input model
bagOfWords
object | bagOfNgrams
object
Input bag-of-words or bag-of-n-grams model, specified as a bagOfWords
object or a bagOfNgrams
object. If bag
is a
bagOfNgrams
object, then the function treats each n-gram as a
single word.
numComponents
— Number of components
positive integer
Number of components, specified as a positive integer. This value must be less than the number of the input documents, and the vocabulary size of the input documents.
Example: 200
counts
— Frequency counts of words
matrix of nonnegative integers
Frequency counts of words, specified as a matrix of nonnegative integers. If you specify
'DocumentsIn'
to be 'rows'
, then the value
counts(i,j)
corresponds to the number of times the
jth word of the vocabulary appears in the ith
document. Otherwise, the value counts(i,j)
corresponds to the number
of times the ith word of the vocabulary appears in the
jth document.
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: 'FeatureStrengthExponent',4
sets the feature strength
exponent to 4.
DocumentsIn
— Orientation of documents
'rows'
(default) | 'columns'
Orientation of documents in the word count matrix, specified as the comma-separated pair
consisting of 'DocumentsIn'
and one of the following:
'rows'
– Input is a matrix of word counts with rows corresponding to documents.'columns'
– Input is a transposed matrix of word counts with columns corresponding to documents.
This option only applies if you specify the input documents as a matrix of word counts.
Note
If you orient your word count matrix so that documents correspond to columns and specify
'DocumentsIn','columns'
, then you might experience a significant
reduction in optimization-execution time.
FeatureStrengthExponent
— Initial feature strength exponent
2 (default) | nonnegative scalar
Initial feature strength exponent, specified as a nonnegative scalar.
This value scales the feature component strengths for the
documentScores
,
wordScores
, and transform
functions.
Example: 'FeatureStrengthExponent',4
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
Output Arguments
mdl
— Output LSA model
lsaModel
object
Output LSA model, returned as an lsaModel
object.
Version History
Introduced in R2017b
Open Example
You have a modified version of this example. Do you want to open this example with your edits?
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)