How can I find number of words from a text file which ends with 'ed', 'es' and 'ing'?

2 views (last 30 days)
I need to find words that ends with 'ed', 'es' and 'ing'from a text file.
I have tried this but its not working.....
fid=fopen('sample.txt');
%repeat while end of file is not reached
while ~feof(fid)
str=(fgetl(fid));
%check for letter 'ed' in end of word
es=findstr(str,'ed');
x=0;
for j=1:length(es)
if es(j)~=length(str)
if isletter(str(es(j)+1))
continue
end
end
x=x+1;
end
%correct count of vowels with 'ed' in end of word
sum = sum+x
end
fclose(fid);

Accepted Answer

Stephen23
Stephen23 on 3 Oct 2017
Edited: Stephen23 on 3 Oct 2017
It would be much simpler to use regexp:
>> str = fileread('sample.txt');
>> numel(regexp(str,'\w+ed\W')) % ed
ans = 8
>> numel(regexp(str,'\w+es\W')) % es
ans = 2
>> numel(regexp(str,'\w+ing\W')) % ing
ans = 1
  4 Comments
Walter Roberson
Walter Roberson on 3 Oct 2017
Use isstrprop() to test for 'upper'. You need to do this instead of counting on 'A':'Z' because RTF includes provision for code pages such as the CP1252 that your document is written in. In particular in CP1252, the following letters are all considered capital:
ABCDEFGHIJKLMNOPQRSTUVWXYZŠŒŽŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ

Sign in to comment.

More Answers (2)

Walter Roberson
Walter Roberson on 3 Oct 2017
That file is in RTF format. There is no direct MATLAB support for reading RTF files, so you either need to parse them yourself or you need to do something like call over to MS Word or into OfficeLibre to read the file form you.
The sample file includes several characters that the headers indicate are encoded in code page 1252; in the general case, it would be necessary to parse the headers to figure out which code page was being used in order to know how to translate them.
Once you have read in the \'XX sequences and converted them into char(hex2dec('XX')), you need to convert the whole thing into a byte stream (uint8 or double) and then use native2unicode(TheByteStream, 'CP1252') or as appropriate for the code page from the header. The result of this will be a MATLAB char vector that you can then start to process.
At this point you start needing to define what a "word" is, in order to define whether you have reached the end of a word in order to test whether the word ends in one of the combinations of interest. See https://www.mathworks.com/matlabcentral/answers/355353-how-to-scan-a-sentence-of-text-one-by-one-word-and-store-it-in-seperate-memory-location#comment_487942 for discussion. I mention there U+2019 "right single quote mark", which does occur in your file in the form \'92 . Your code needs to analyze the text to figure out whether those occur in the context of being a quotation with a previous "left single quote mark", U+2018. It happens that U+2018 does not occur in that one sample file: quotations in that sample file happen to use U+201C and U+201D, left and right double quotes, there encoded as \'93 and \'94, so in fact the U+2019 that occur are indicating contractions or possessives, so you need to take that into account.
The sample text also uses U+2013, en-dash, here encoded as \'96 . en-dash are used to include numeric ranges (in which case the two parts should be considered separately), and to include directly such as "New York-San Francisco flight" (in which case the two parts should be considered separately), but the primary use is to indicate compound words (in which case the two parts should not be considered separately.) For example, "the disjointed–listing–process" does not contain any words ending in "ed" or "ing": it is one single word that ends in "ess". I do not think there is any syntactic way to distinguish en-dash for direction (which should be split) from en-dash for compound words (which should not be split.) But read on.
The sample text does not use U+2014, em-dash, which would be \'97 if it occurred. However, the way it uses U+2013, en-dash, is in error: the en-dash it uses should obviously have been a em-dash instead. We can get a clue by noticing that the \'96 has a space on both sides of it. Used properly, neither em-dash nor en-dash should have a space on either side. Therefore in order to do proper analysis, you will need to examine any en-dash in the text, and if there is a space on either side, weaken it to em-dash (which always breaks the word.)
Code Page 1252 is not able to encode either U+2010, hyphen, or U+00AD, soft-hyphen, neither of which are word breaks, so if you are certain you are using CP1252 you can do less processing for those. On the other hand that means that you have to worry about whether the U+002D "-" the minus-hyphen represents negative signs or hyphen or entities that should have been encoded as en-dash or em-dash -- since we already know that the text is not using the hyphens and dashes properly.

KSSV
KSSV on 3 Oct 2017
fid=fopen('sample.txt');
S = textscan(fid,'%s','delimiter','\n') ;
fclose(fid) ;
S = S{1} ;
check = {'ed', 'es', 'ing'} ;
total_ed = 0 ;
total_es = 0 ;
total_ing = 0 ;
for i = 1:length(S)
str = strsplit(S{i}) ;
% count ed
idx1 = find(contains(str,'ed')) ;
total_ed = total_ed+length(idx1) ;
% count es
idx2 = find(contains(str,'es')) ;
total_es = total_es+length(idx2) ;
% count ing
idx3 = find(contains(str,'ing')) ;
total_ing = total_ing+length(idx3) ;
end
  1 Comment
Nipun
Nipun on 3 Oct 2017
Thank you for your help. But I need to find, only the words that ends with ed', es and ing... this one is calculating all the words that has es and ed and ing

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!