measuring term frequency of words

I have been able to obtain a bag of words from a document. Please, how can I interact with the bag of words array, so I may make calculations on the frequency of terms within each document?
str = extractFileText('file.txt');
paras = split(str,"</P>");
paras(end) = []; % the split left an empty last entry
paras = extractAfter(paras,">") % Drop the "<P ID=n>" from the beginning
tdoc = tokenizedDocument(lower(paras));
bag = bagOfWords(tdoc)
I have this result:
For clarification, I believe the columns are the terms, while the rows are the documents. Am I right?
I loaded 2 txt files (1 document set, 1 query set) I want to evaluate similarity between each document and each query by Cosine similarity, tf-idf or whatsoever means.

3 Comments

friend how to find term frequency only in MATLAB for after bag of words model.Please tell me.for using the formula number of time term t appear in document /total number of terms in document.
If I understand your question correctly, you can simply divide the counts, aka term frequency, by the document length. You may need to adapt the orientation of the vectors a bit, and also transpose everything if you want to, as I did here, display them in a table:
>> str = ["This is a short document.",...
"This is a longer document. With more tokens. Maybe that is about enough?"];
>> td = tokenizedDocument(str)
td =
1×2 tokenizedDocument:
6 tokens: This is a short document .
16 tokens: This is a longer document . With more tokens . Maybe that is about enough ?
>> bow = bagOfWords(td);
>> relFreq = bow.Counts ./ doclength(td).';
>> table(bow.Vocabulary.', relFreq.', 'VariableNames',["Word","relative Frequency"])
ans =
15×2 table
Word relative Frequency
__________ __________________
"This" 0.16667 0.0625
"is" 0.16667 0.125
"a" 0.16667 0.0625
"short" 0.16667 0
"document" 0.16667 0.0625
"." 0.16667 0.125
"longer" 0 0.0625
"With" 0 0.0625
"more" 0 0.0625
"tokens" 0 0.0625
"Maybe" 0 0.0625
"that" 0 0.0625
"about" 0 0.0625
"enough" 0 0.0625
"?" 0 0.0625
Can i ask, is there any way to find the frequency and the number of repeated letters,pair of letters, space in a note, word or pdf file??

Sign in to comment.

 Accepted Answer

See the bagOfWords documentation. E.g., you can use the tfidf function, you can extract bag.Counts and use pdist(bag.Counts,'cosine'), you can use fitlsa for what is essentially a principal component analysis for dimensionality reduction, or fitlda to train/fit a topic model.

2 Comments

I need to compute the similarity between each query loaded in QueTF and each document in DocTF.
How may I do that? QueTF and DocTF are both bag of words.
What is the significance of pdist2?
I am having problems applying this to the bag of words.
Cosss = pdist2(QueTF,DocTF,'cosine');
John, you need to encode both sets of documents with the same bag-of-words model. (That model not only contains counts, it also has a specific mapping which word to put into which position, and if you use tfidf, you need to use the same idf factors for consistency within your analysis.) Something like this:
corpus = tokenizedDocument(corpusData);
bow = bagOfWords(corpus);
query = tokenizedDocument(queryData);
queryVectors = encode(bow,query);
dists = pdist2(queryVectors,bow.Counts,'cosine');

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!