How can I find number of words from a text file which ends with 'ed', 'es' and 'ing'?

Question

Nipun on 3 Oct 2017

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/359422-how-can-i-find-number-of-words-from-a-text-file-which-ends-with-ed-es-and-ing

Commented: Stephen23 on 3 Oct 2017

Accepted Answer: Stephen23

Open in MATLAB Online

I need to find words that ends with 'ed', 'es' and 'ing'from a text file.

I have tried this but its not working.....

fid=fopen('sample.txt');
%repeat while end of file is not reached
while ~feof(fid)
  str=(fgetl(fid));
%check for letter 'ed' in end of word
  es=findstr(str,'ed');
    x=0;
    for j=1:length(es)
        if es(j)~=length(str)
          if isletter(str(es(j)+1))
            continue
          end
        end
        x=x+1;
    end
%correct count of vowels with 'ed' in end of word
  sum = sum+x
end
fclose(fid);

2 Comments
Show NoneHide None

KSSV on 3 Oct 2017

Why don't you attach your sample.txt file.

Nipun on 3 Oct 2017

sample.txt

I have attached sample.txt

Sign in to comment.

Sign in to answer this question.

Answer 1

Stephen23 on 3 Oct 2017

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/359422-how-can-i-find-number-of-words-from-a-text-file-which-ends-with-ed-es-and-ing#answer_284082

Edited: Stephen23 on 3 Oct 2017

Open in MATLAB Online

It would be much simpler to use regexp:

>> str = fileread('sample.txt');
>> numel(regexp(str,'\w+ed\W')) % ed
ans =  8
>> numel(regexp(str,'\w+es\W')) % es
ans =  2
>> numel(regexp(str,'\w+ing\W')) % ing
ans =  1

4 Comments
Show 2 older commentsHide 2 older comments

Walter Roberson on 3 Oct 2017

Open in MATLAB Online

Use isstrprop() to test for 'upper'. You need to do this instead of counting on 'A':'Z' because RTF includes provision for code pages such as the CP1252 that your document is written in. In particular in CP1252, the following letters are all considered capital:

ABCDEFGHIJKLMNOPQRSTUVWXYZŠŒŽŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ

Stephen23 on 3 Oct 2017

Try '[A-Z]\w+'

Sign in to comment.

Answer 2

Walter Roberson on 3 Oct 2017

1
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/359422-how-can-i-find-number-of-words-from-a-text-file-which-ends-with-ed-es-and-ing#answer_284085

That file is in RTF format. There is no direct MATLAB support for reading RTF files, so you either need to parse them yourself or you need to do something like call over to MS Word or into OfficeLibre to read the file form you.

The sample file includes several characters that the headers indicate are encoded in code page 1252; in the general case, it would be necessary to parse the headers to figure out which code page was being used in order to know how to translate them.

Once you have read in the \'XX sequences and converted them into char(hex2dec('XX')), you need to convert the whole thing into a byte stream (uint8 or double) and then use native2unicode(TheByteStream, 'CP1252') or as appropriate for the code page from the header. The result of this will be a MATLAB char vector that you can then start to process.

At this point you start needing to define what a "word" is, in order to define whether you have reached the end of a word in order to test whether the word ends in one of the combinations of interest. See https://www.mathworks.com/matlabcentral/answers/355353-how-to-scan-a-sentence-of-text-one-by-one-word-and-store-it-in-seperate-memory-location#comment_487942 for discussion. I mention there U+2019 "right single quote mark", which does occur in your file in the form \'92 . Your code needs to analyze the text to figure out whether those occur in the context of being a quotation with a previous "left single quote mark", U+2018. It happens that U+2018 does not occur in that one sample file: quotations in that sample file happen to use U+201C and U+201D, left and right double quotes, there encoded as \'93 and \'94, so in fact the U+2019 that occur are indicating contractions or possessives, so you need to take that into account.

The sample text also uses U+2013, en-dash, here encoded as \'96 . en-dash are used to include numeric ranges (in which case the two parts should be considered separately), and to include directly such as "New York-San Francisco flight" (in which case the two parts should be considered separately), but the primary use is to indicate compound words (in which case the two parts should not be considered separately.) For example, "the disjointed–listing–process" does not contain any words ending in "ed" or "ing": it is one single word that ends in "ess". I do not think there is any syntactic way to distinguish en-dash for direction (which should be split) from en-dash for compound words (which should not be split.) But read on.

The sample text does not use U+2014, em-dash, which would be \'97 if it occurred. However, the way it uses U+2013, en-dash, is in error: the en-dash it uses should obviously have been a em-dash instead. We can get a clue by noticing that the \'96 has a space on both sides of it. Used properly, neither em-dash nor en-dash should have a space on either side. Therefore in order to do proper analysis, you will need to examine any en-dash in the text, and if there is a space on either side, weaken it to em-dash (which always breaks the word.)

Code Page 1252 is not able to encode either U+2010, hyphen, or U+00AD, soft-hyphen, neither of which are word breaks, so if you are certain you are using CP1252 you can do less processing for those. On the other hand that means that you have to worry about whether the U+002D "-" the minus-hyphen represents negative signs or hyphen or entities that should have been encoded as en-dash or em-dash -- since we already know that the text is not using the hyphens and dashes properly.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Answer 3

KSSV on 3 Oct 2017

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/359422-how-can-i-find-number-of-words-from-a-text-file-which-ends-with-ed-es-and-ing#answer_284080

Open in MATLAB Online

fid=fopen('sample.txt');
S = textscan(fid,'%s','delimiter','\n') ;
fclose(fid) ;
S = S{1} ;
check = {'ed', 'es', 'ing'} ;
total_ed = 0 ;
total_es = 0 ;
total_ing = 0 ;
for i = 1:length(S)
    str = strsplit(S{i}) ;
    % count ed
    idx1 = find(contains(str,'ed')) ;
    total_ed = total_ed+length(idx1) ;
    % count es
    idx2 = find(contains(str,'es')) ;
    total_es = total_es+length(idx2) ;
    % count ing
    idx3 = find(contains(str,'ing')) ;
    total_ing = total_ing+length(idx3) ;
end

1 Comment
Show -1 older commentsHide -1 older comments

Nipun on 3 Oct 2017

Thank you for your help. But I need to find, only the words that ends with ed', es and ing... this one is calculating all the words that has es and ed and ing

Sign in to comment.

How can I find number of words from a text file which ends with 'ed', 'es' and 'ing'?

2 Comments
Show NoneHide None

Accepted Answer

4 Comments
Show 2 older commentsHide 2 older comments

More Answers (2)

0 Comments
Show -2 older commentsHide -2 older comments

1 Comment
Show -1 older commentsHide -1 older comments

See Also

Categories

Tags

Community Treasure Hunt

How can I find number of words from a text file which ends with 'ed', 'es' and 'ing'?

2 Comments Show NoneHide None

Accepted Answer

4 Comments Show 2 older commentsHide 2 older comments

More Answers (2)

0 Comments Show -2 older commentsHide -2 older comments

1 Comment Show -1 older commentsHide -1 older comments

See Also

Categories

Tags

Community Treasure Hunt

2 Comments
Show NoneHide None

4 Comments
Show 2 older commentsHide 2 older comments

0 Comments
Show -2 older commentsHide -2 older comments

1 Comment
Show -1 older commentsHide -1 older comments