Why is regexp including extra data?

Question

Brad on 29 Apr 2013

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/74027-why-is-regexp-including-extra-data

I’m trying to use REGEXP to match the following flags (State, Post, RC, State, Junk) , and then create a cell array of strings.

My inputs are:

25.187466  156.162447  21578.188  97.134234  State  AAAAA  1  C00B
25.287466  156.162447  21578.288  97.234234  Post  BBBBB  2  C11B
25.387466  156.362447  21578.388  97.334234  RC  CCCCC  3  C22B
25.387466  156.362447  21578.388  97.334234  State  DDDDD  4  C33B
25.387466  156.362447  21578.388  97.334234  Junk  EEEEE  5  C44B

I’m using the following MATLAB commands:

data = regexp(LineTxt,'-?\d+(\.\d+)?','split');
Flag=cellstr(data{1,6});
For unknown reasons I keep getting the following output:
'  State  AAAAA'  '  Post  BBBBB'  '  RC  CCCCC'  '  State  DDDDD'  '  Junk  EEEEE'

Intended output is:

' State ' ' Post ' ' RC ' ' State ' ' Junk'

Why are the extra fields being included?

1 Comment
Show -1 older commentsHide -1 older comments

Cedric on 29 Apr 2013

Edited: Cedric on 29 Apr 2013

So here, what you call 'State', 'Post", are indicating places where what you want to get lies (e.g. Alabama, Michigan, etc) or do you really have to get the words 'State', 'Post', etc?

EDIT: reading your comment below, you have other line with a different structure in between these lines .. how do they look like? I mean by that, what can we use to make the match specific to these relevant lines?

Sign in to comment.

Sign in to answer this question.

Answer 1

per isakson on 29 Apr 2013

2
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/74027-why-is-regexp-including-extra-data#answer_83880

Edited: per isakson on 30 Apr 2013

Open in MATLAB Online

Neither do I understand why. Regular expressions are tricky. Another approach:

    loop over all rows
        data = regexp( str, '((State)|(Post)|(RC)|(State)|(Junk))', 'match','once' );
    end

[Edit] IMO Given that the file is written with a similar format string, textscan is the best way to read the file.

    fid = fopen( 'cssm.txt' );
    cac = textscan( fid, '%*u%*f%*f%*f%*f%s%*s%*d%*s' );
    fclose( fid );
    cac{:}

returns

    ans = 
        'State'
        'Post'
        'RC'
        'State'
        'Junk'
.

Answer to comment:

    str = fileread( 'cssm.txt' );
    data = regexp( str, '((State)|(Post)|(RC)|(State)|(Junk))', 'match' )

returns

    data = 
        'State'    'Post'    'RC'    'State'    'Junk'

where cssm.txt contains your five lines of data

However, the smarter the solution the harder it is to make a robust (and flexible) code. How should the code behave if there are rows in the file, which do not adhere to the "format" that I interfere from your example? And in a few days, you might find that you need the third column.

4 Comments
Show 2 older commentsHide 2 older comments

Cedric on 30 Apr 2013

Edited: Cedric on 30 Apr 2013

I'm voting for Per's edit if all you need is to have the sequence of these words and that they are known a priori.

Brad on 30 Apr 2013

Open in MATLAB Online

Per, Cedric: First off, thank you for taking a look at this problem. Even though the REGEXP function appears to be quite valuable, I have little experience with it. After some additional reading last night, I varied my approach to solving this problem by implementing/testing the following:

data = regexp(LineTxt,' +','split'); TrajFlag=cellstr(data{1,6});

Result:

'97.134234' '97.234234' '97.334234' 'State' 'Junk'

In executing this, I believe I found the source of the problem. Each row of data begins with a number (in this case numbers 1, 2, 9, 99, and 999). For values greater than or equal to 10, this problem doesn’t exist. I spoke with a one of the developers this morning and verified that the values are between 1 and 999 – but are always in the least digits format. So instead of 001, their code outputs 1. Instead of 010, it outputs 10. For all 3-digit values (100-999), the output is always in 3 digits.

So I get '97.134234' '97.234234' '97.334234' instead of ‘State’ ‘Post’ ‘RC’ when using the above commands.

If data = regexp(LineTxt,' +','split') (one space removed), I get the following result;

'97.134234' '97.234234' '97.334234' '97.334234' 'Junk'

If data = regexp(LineTxt,' +','split') (3 spaces prior to the +), I get the following error;

Index exceeds matrix dimensions. Error in RSA(line 102) TrajFlag=cellstr(data{1,6});

I’m just reading your comments now:

1. The file I read in has a repeating structure that looks like this:

01Jan2013 Live 2 1 01 0000000001 Low

1 25.187466 156.162447 21578.188 97.134234 State AAAAA 1 C00B

21600.123 21612.122

3100.435 4100.380 5100.739

-0.491736 1.491492 0.891808

0.051748 -0.071254 0.021175

9 x 9 matrix of floats

01Jan2013 Live 2 1 01 0000000002 Low

2 25.287466 156.162447 21578.288 97.234234 Post BBBBB 2 C11B

21602.223 21612.222

3200.435 4200.380 5200.739

-0.492736 1.492492 0.892180

0.052748 -0.072254 0.022175

9 x 9 matrix of floats

01Jan2013 Live 2 1 01 0000000003 Low

9 25.387466 156.362447 21578.388 97.334234 RC CCCCC 3 C22B

21604.333 21612.333

3300.435 4300.380 5300.739

-0.493736 1.493492 0.893180

0.053748 -0.073254 0.023175

.

Knowing I receive a number of messages in a repeating pattern, I’ve implemented the following, where the line in question for each message is line number 2. for scr = 1:Num_MESs

    while ~feof(fid) 
        LineNum = LineNum + 1;
.
.
.
.
.
      Read one line (row) of text from the input file at a time 
        LineTxt = fgetl(fid);
        if LineNum == 2
           data = regexp(LineTxt,'   +','split');
           Flag=cellstr(data{1,6});
        end
.
.
.
.
.           
    end
    fclose(fid);
end

2. I agree that my original approach was too generic to get the intended result – the words ‘State’ ‘Post’ ‘RC’ ‘State’ ‘Junk’. I’m working with the match string output in hopes of achieving the desired result.

Thanks for your inputs.

Sign in to comment.

Answer 2

Cedric on 30 Apr 2013

1
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/74027-why-is-regexp-including-extra-data#answer_83891

Edited: Cedric on 30 Apr 2013

Open in MATLAB Online

EDIT: if all words that you want to extract are known a priori, Per gave you the answer.

Your pattern is too generic; it splits the content each time a number is found, and what you get taking the first output arg. of REGEXP is the non-matching remainder.

See my comment below your question, and if you provide me with more information, we can build a solution based on a clean match (if possible).

To give you an example, if lines between these relevant lines that you mention were containing only numeric values, you could match the keywords that you are interested in as follows:

 >> buffer = fileread('myFile.txt') ;
 >> match = regexp(buffer, '(?<=\d\s+)[a-zA-Z]*(?=\s)', 'match')
 match = 
    'State'    'Post'    'RC'    'State'    'Junk'

Note that we could use REGEXPI to make the pattern a little simpler. Here, we defined patterns to be matched as:

As many letters as possible (only chars. a to z in lower and upper cases are allowed): [a-zA-Z]*
Preceded by (positive look backward (?<=)): a numeric char. \d followed by one or more white spaces \s+
Followed by (positive look forward (?=)): one white space \s

.. and for those who don't know regular expressions, the regexp engine looks for this pattern in buffer (which contains the file's content) until it finds a match, saves this match in a cell array, restarts after the match (it eats the string in some sense) and goes on iteratively matching/saving until the end of buffer. As the 3rd arg. asks for 'match', REGEXP outputs the cell array of matches.

1 Comment
Show -1 older commentsHide -1 older comments

Brad on 30 Apr 2013

Per, Cedric: Thanks for the push in the right direction. Code appears to be running like a champ!!

Sign in to comment.

Why is regexp including extra data?

1 Comment
Show -1 older commentsHide -1 older comments

Accepted Answer

4 Comments
Show 2 older commentsHide 2 older comments

More Answers (1)

1 Comment
Show -1 older commentsHide -1 older comments

See Also

Categories

Tags

Products

Community Treasure Hunt

Why is regexp including extra data?

1 Comment Show -1 older commentsHide -1 older comments

Accepted Answer

4 Comments Show 2 older commentsHide 2 older comments

More Answers (1)

1 Comment Show -1 older commentsHide -1 older comments

See Also

Categories

Tags

Products

Community Treasure Hunt

1 Comment
Show -1 older commentsHide -1 older comments

4 Comments
Show 2 older commentsHide 2 older comments

1 Comment
Show -1 older commentsHide -1 older comments