Thread Subject: reading formatted strings

Subject: reading formatted strings

From: CyberFrog

Date: 21 Mar, 2010 23:18:04

Message: 1 of 7

Hi,

I would like to read certain lines from a file where each line is classed as a string. First I would like to find a certain matching field and then once found be able to read in the 3 field from where the original match was found. For example, pretend the line I have read in is 'Blackbird singing in the dead of night'. I then want to first find and match the field 'in' and if found I would then like to read in the 3rd field from that point in this case 'dead'. What is the best way to do this in code? I have been using the textscan function but the problem with this is you have to specify the format i.e. %s %s would read as Blackbird singing however, the lines that I would read will always vary in the number of field so I cannot use this I don't think.

Many thanks

CF

Subject: reading formatted strings

From: TideMan

Date: 21 Mar, 2010 23:46:42

Message: 2 of 7

On Mar 22, 12:18 pm, "CyberFrog" <domle...@hotmail.com> wrote:
> Hi,
>
> I would like to read certain lines from a file where each line is classed as a string.  First I would like to find a certain matching field and then once found be able to read in the 3 field from where the original match was found.  For example, pretend the line I have read in is 'Blackbird singing in the dead of night'.  I then want to first find and match the field 'in' and if found I would then like to read in the 3rd field from that point in this case 'dead'.  What is the best way to do this in code?  I have been using the textscan function but the problem with this is you have to specify the format i.e. %s %s would read as Blackbird singing however, the lines that I would read will always vary in the number of field so I cannot use this I don't think.
>
> Many thanks
>
> CF

So, you need to read in one line at a time as a string?
help fgetl

Subject: reading formatted strings

From: Walter Roberson

Date: 22 Mar, 2010 02:07:27

Message: 3 of 7

CyberFrog wrote:

> I would like to read certain lines from a file where each line is
> classed as a string. First I would like to find a certain matching
> field and then once found be able to read in the 3 field from where the
> original match was found. For example, pretend the line I have read in
> is 'Blackbird singing in the dead of night'. I then want to first find
> and match the field 'in' and if found I would then like to read in the
> 3rd field from that point in this case 'dead'. What is the best way to
> do this in code? I have been using the textscan function but the
> problem with this is you have to specify the format i.e. %s %s would
> read as Blackbird singing however, the lines that I would read will
> always vary in the number of field so I cannot use this I don't think.

If you want to use textscan, then set Delimiter to be '\n' and it will
then read a line at a time. Or use the earlier suggesting of switching
over to fgetl() instead of textscan.

Once you the line read in, then I suggest you consider using regexp

WordToMatch = 'in'
thirdfield = regexp(TheLine, ['\<' WordToMatch
'/>(?\s+\w+){2}(\w+)(?$|\s)'], 'tokens');

if ~isempty(thirdfield)
   disp('Third field was: ', thirdfield{1});
else
   disp('Did not match the pattern!')
end

Subject: reading formatted strings

From: CyberFrog

Date: 22 Mar, 2010 08:05:08

Message: 4 of 7

Walter Roberson <roberson@hushmail.com> wrote in message <ho6jd0$50u$1@canopus.cc.umanitoba.ca>...
> CyberFrog wrote:
>
> > I would like to read certain lines from a file where each line is
> > classed as a string. First I would like to find a certain matching
> > field and then once found be able to read in the 3 field from where the
> > original match was found. For example, pretend the line I have read in
> > is 'Blackbird singing in the dead of night'. I then want to first find
> > and match the field 'in' and if found I would then like to read in the
> > 3rd field from that point in this case 'dead'. What is the best way to
> > do this in code? I have been using the textscan function but the
> > problem with this is you have to specify the format i.e. %s %s would
> > read as Blackbird singing however, the lines that I would read will
> > always vary in the number of field so I cannot use this I don't think.
>
> If you want to use textscan, then set Delimiter to be '\n' and it will
> then read a line at a time. Or use the earlier suggesting of switching
> over to fgetl() instead of textscan.
>
> Once you the line read in, then I suggest you consider using regexp
>
> WordToMatch = 'in'
> thirdfield = regexp(TheLine, ['\<' WordToMatch
> '/>(?\s+\w+){2}(\w+)(?$|\s)'], 'tokens');
>
> if ~isempty(thirdfield)
> disp('Third field was: ', thirdfield{1});
> else
> disp('Did not match the pattern!')
> end

Hey thanks Walter, i'll certainly get this going with these commands instead. Think I was getting too bogged down with using textscan and findstr, it helps with second opinions sometimes.

Subject: reading formatted strings

From: CyberFrog

Date: 22 Mar, 2010 12:07:03

Message: 5 of 7

"CyberFrog" <domlee55@hotmail.com> wrote in message <ho78bk$l0k$1@fred.mathworks.com>...
> Walter Roberson <roberson@hushmail.com> wrote in message <ho6jd0$50u$1@canopus.cc.umanitoba.ca>...
> > CyberFrog wrote:
> >
> > > I would like to read certain lines from a file where each line is
> > > classed as a string. First I would like to find a certain matching
> > > field and then once found be able to read in the 3 field from where the
> > > original match was found. For example, pretend the line I have read in
> > > is 'Blackbird singing in the dead of night'. I then want to first find
> > > and match the field 'in' and if found I would then like to read in the
> > > 3rd field from that point in this case 'dead'. What is the best way to
> > > do this in code? I have been using the textscan function but the
> > > problem with this is you have to specify the format i.e. %s %s would
> > > read as Blackbird singing however, the lines that I would read will
> > > always vary in the number of field so I cannot use this I don't think.
> >
> > If you want to use textscan, then set Delimiter to be '\n' and it will
> > then read a line at a time. Or use the earlier suggesting of switching
> > over to fgetl() instead of textscan.
> >
> > Once you the line read in, then I suggest you consider using regexp
> >
> > WordToMatch = 'in'
> > thirdfield = regexp(TheLine, ['\<' WordToMatch
> > '/>(?\s+\w+){2}(\w+)(?$|\s)'], 'tokens');
> >
> > if ~isempty(thirdfield)
> > disp('Third field was: ', thirdfield{1});
> > else
> > disp('Did not match the pattern!')
> > end
>
> Hey thanks Walter, i'll certainly get this going with these commands instead. Think I was getting too bogged down with using textscan and findstr, it helps with second opinions sometimes.

Hi,

I have now tried this and exactly what I need accept, how do I then extract the 3rd field i.e. dead only. Using thirdfield gives all three fields??

thanks

Subject: reading formatted strings

From: Walter Roberson

Date: 22 Mar, 2010 22:07:34

Message: 6 of 7

CyberFrog wrote:
> "CyberFrog" <domlee55@hotmail.com> wrote in message
> <ho78bk$l0k$1@fred.mathworks.com>...
>> Walter Roberson <roberson@hushmail.com> wrote in message
>> <ho6jd0$50u$1@canopus.cc.umanitoba.ca>...

>> > thirdfield = regexp(TheLine, ['\<' WordToMatch >
>> '/>(?\s+\w+){2}(\w+)(?$|\s)'], 'tokens');

> I have now tried this and exactly what I need accept, how do I then
> extract the 3rd field i.e. dead only. Using thirdfield gives all three
> fields??

Sorry, I had some typos in the expression. That tends to happen when you write
free-hand perl-style regular expressions when you are tired :(

The corrected line is:

thirdfield = regexp(TheLine, ['\<' WordToMatch
'\>(?:\s+\w+){2}\s+(\w+)(?:$|\W)], 'tokens');

Note that "the third field from where the original match was found" would be
'of'. 'the' would be the first field from the match, 'dead' would be the
second field, 'of' would be the third field. If you want to match the 'dead'
(that is, skip one word after the end of the target word), then change the {2}
to {1}, or remove the {2} entirely, or you could rewrite the second part as
'\>\s+\w+\s+(\w+)(?:$|\s)'

By the way,the purpose of the (?:$|\W) is to prevent matching possibilities
such as jim23 or it's or anything else that contains a non-alphabetic
character. If you want to change this to match up to the next whitespace, then
replace the (\w+) with (\S+) and then you can drop the (?:$|\W). Note though
that if you make this change then if there happens to be punctuation
immediately after the word to be extracted, the punctuation will be brought in
as well. Automatically determining what is punctuation and what is part of the
word can be a bit tricky in English, as an apostrophe directly after a word
might be closing a quotation or might be indicating the possessive of a
plural. English cannot be analyzed properly using a Context Free Grammar (CFG).

Subject: reading formatted strings

From: CyberFrog

Date: 22 Mar, 2010 22:45:21

Message: 7 of 7

Walter Roberson <roberson@hushmail.com> wrote in message <ho8pn8$oni$1@canopus.cc.umanitoba.ca>...
> CyberFrog wrote:
> > "CyberFrog" <domlee55@hotmail.com> wrote in message
> > <ho78bk$l0k$1@fred.mathworks.com>...
> >> Walter Roberson <roberson@hushmail.com> wrote in message
> >> <ho6jd0$50u$1@canopus.cc.umanitoba.ca>...
>
> >> > thirdfield = regexp(TheLine, ['\<' WordToMatch >
> >> '/>(?\s+\w+){2}(\w+)(?$|\s)'], 'tokens');
>
> > I have now tried this and exactly what I need accept, how do I then
> > extract the 3rd field i.e. dead only. Using thirdfield gives all three
> > fields??
>
> Sorry, I had some typos in the expression. That tends to happen when you write
> free-hand perl-style regular expressions when you are tired :(
>
> The corrected line is:
>
> thirdfield = regexp(TheLine, ['\<' WordToMatch
> '\>(?:\s+\w+){2}\s+(\w+)(?:$|\W)], 'tokens');
>
> Note that "the third field from where the original match was found" would be
> 'of'. 'the' would be the first field from the match, 'dead' would be the
> second field, 'of' would be the third field. If you want to match the 'dead'
> (that is, skip one word after the end of the target word), then change the {2}
> to {1}, or remove the {2} entirely, or you could rewrite the second part as
> '\>\s+\w+\s+(\w+)(?:$|\s)'
>
> By the way,the purpose of the (?:$|\W) is to prevent matching possibilities
> such as jim23 or it's or anything else that contains a non-alphabetic
> character. If you want to change this to match up to the next whitespace, then
> replace the (\w+) with (\S+) and then you can drop the (?:$|\W). Note though
> that if you make this change then if there happens to be punctuation
> immediately after the word to be extracted, the punctuation will be brought in
> as well. Automatically determining what is punctuation and what is part of the
> word can be a bit tricky in English, as an apostrophe directly after a word
> might be closing a quotation or might be indicating the possessive of a
> plural. English cannot be analyzed properly using a Context Free Grammar (CFG).

Many thanks Walter very informative

Tags for this Thread

Add a New Tag:

Separated by commas
Ex.: root locus, bode

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

rssFeed for this Thread

Contact us at files@mathworks.com