Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Why is regexp including extra data?

Asked by Brad on 29 Apr 2013

I’m trying to use REGEXP to match the following flags (State, Post, RC, State, Junk) , and then create a cell array of strings.

My inputs are:

1  25.187466  156.162447  21578.188  97.134234  State  AAAAA  1  C00B
2  25.287466  156.162447  21578.288  97.234234  Post  BBBBB  2  C11B
9  25.387466  156.362447  21578.388  97.334234  RC  CCCCC  3  C22B
99  25.387466  156.362447  21578.388  97.334234  State  DDDDD  4  C33B
999  25.387466  156.362447  21578.388  97.334234  Junk  EEEEE  5  C44B

I’m using the following MATLAB commands:

data = regexp(LineTxt,'-?\d+(\.\d+)?','split');
Flag=cellstr(data{1,6});
For unknown reasons I keep getting the following output:
'  State  AAAAA'	'  Post  BBBBB'	'  RC  CCCCC'	'  State  DDDDD'	'  Junk  EEEEE'

Intended output is:

' State '	' Post '	' RC '	' State '	'  Junk'

Why are the extra fields being included?

1 Comment

Cedric Wannaz on 29 Apr 2013

So here, what you call 'State', 'Post", are indicating places where what you want to get lies (e.g. Alabama, Michigan, etc) or do you really have to get the words 'State', 'Post', etc?

EDIT: reading your comment below, you have other line with a different structure in between these lines .. how do they look like? I mean by that, what can we use to make the match specific to these relevant lines?

Brad

Products

2 Answers

Answer by per isakson on 29 Apr 2013
Edited by per isakson on 30 Apr 2013
Accepted answer

Neither do I understand why. Regular expressions are tricky. Another approach:

    loop over all rows
        data = regexp( str, '((State)|(Post)|(RC)|(State)|(Junk))', 'match','once' );
    end

[Edit] IMO Given that the file is written with a similar format string, textscan is the best way to read the file.

    %%
    fid = fopen( 'cssm.txt' );
    cac = textscan( fid, '%*u%*f%*f%*f%*f%s%*s%*d%*s' );
    fclose( fid );
    cac{:}

returns

    ans = 
        'State'
        'Post'
        'RC'
        'State'
        'Junk'
.

Answer to comment:

    str = fileread( 'cssm.txt' );
    data = regexp( str, '((State)|(Post)|(RC)|(State)|(Junk))', 'match' )

returns

    data = 
        'State'    'Post'    'RC'    'State'    'Junk'

where cssm.txt contains your five lines of data

However, the smarter the solution the harder it is to make a robust (and flexible) code. How should the code behave if there are rows in the file, which do not adhere to the "format" that I interfere from your example? And in a few days, you might find that you need the third column.

4 Comments

per isakson on 30 Apr 2013

Yes, see above.

Cedric Wannaz on 30 Apr 2013

I'm voting for Per's edit if all you need is to have the sequence of these words and that they are known a priori.

Brad on 30 Apr 2013

Per, Cedric: First off, thank you for taking a look at this problem. Even though the REGEXP function appears to be quite valuable, I have little experience with it. After some additional reading last night, I varied my approach to solving this problem by implementing/testing the following:

data = regexp(LineTxt,' +','split'); TrajFlag=cellstr(data{1,6});

Result:

'97.134234' '97.234234' '97.334234' 'State' 'Junk'

In executing this, I believe I found the source of the problem. Each row of data begins with a number (in this case numbers 1, 2, 9, 99, and 999). For values greater than or equal to 10, this problem doesn’t exist. I spoke with a one of the developers this morning and verified that the values are between 1 and 999 – but are always in the least digits format. So instead of 001, their code outputs 1. Instead of 010, it outputs 10. For all 3-digit values (100-999), the output is always in 3 digits.

So I get '97.134234' '97.234234' '97.334234' instead of ‘State’ ‘Post’ ‘RC’ when using the above commands.

If data = regexp(LineTxt,' +','split') (one space removed), I get the following result;

'97.134234' '97.234234' '97.334234' '97.334234' 'Junk'

If data = regexp(LineTxt,' +','split') (3 spaces prior to the +), I get the following error;

Index exceeds matrix dimensions. Error in RSA(line 102) TrajFlag=cellstr(data{1,6});

I’m just reading your comments now:

1. The file I read in has a repeating structure that looks like this:

01Jan2013 Live 2 1 01 0000000001 Low

1  25.187466  156.162447  21578.188  97.134234  State  AAAAA  1  C00B

21600.123 21612.122

3100.435 4100.380 5100.739

-0.491736 1.491492 0.891808

0.051748 -0.071254 0.021175

9 x 9 matrix of floats

01Jan2013 Live 2 1 01 0000000002 Low

2  25.287466  156.162447  21578.288  97.234234  Post  BBBBB  2  C11B

21602.223 21612.222

3200.435 4200.380 5200.739

-0.492736 1.492492 0.892180

0.052748 -0.072254 0.022175

9 x 9 matrix of floats

01Jan2013 Live 2 1 01 0000000003 Low

9  25.387466  156.362447  21578.388  97.334234  RC  CCCCC  3  C22B

21604.333 21612.333

3300.435 4300.380 5300.739

-0.493736 1.493492 0.893180

0.053748 -0.073254 0.023175

.

.

.

.

.

.

.

.

Knowing I receive a number of messages in a repeating pattern, I’ve implemented the following, where the line in question for each message is line number 2. for scr = 1:Num_MESs

    while ~feof(fid) 
        LineNum = LineNum + 1;
.
.
.
.
.
      Read one line (row) of text from the input file at a time 
        LineTxt = fgetl(fid);
        if LineNum == 2
           data = regexp(LineTxt,'   +','split');
           Flag=cellstr(data{1,6});
        end
.
.
.
.
.    	     
    end
    fclose(fid);
end

2. I agree that my original approach was too generic to get the intended result – the words ‘State’ ‘Post’ ‘RC’ ‘State’ ‘Junk’. I’m working with the match string output in hopes of achieving the desired result.

Thanks for your inputs.

per isakson
Answer by Cedric Wannaz on 30 Apr 2013
Edited by Cedric Wannaz on 30 Apr 2013

EDIT: if all words that you want to extract are known a priori, Per gave you the answer.

Your pattern is too generic; it splits the content each time a number is found, and what you get taking the first output arg. of REGEXP is the non-matching remainder.

See my comment below your question, and if you provide me with more information, we can build a solution based on a clean match (if possible).

To give you an example, if lines between these relevant lines that you mention were containing only numeric values, you could match the keywords that you are interested in as follows:

 >> buffer = fileread('myFile.txt') ;
 >> match = regexp(buffer, '(?<=\d\s+)[a-zA-Z]*(?=\s)', 'match')
 match = 
    'State'    'Post'    'RC'    'State'    'Junk'

Note that we could use REGEXPI to make the pattern a little simpler. Here, we defined patterns to be matched as:

  • As many letters as possible (only chars. a to z in lower and upper cases are allowed): [a-zA-Z]*
  • Preceded by (positive look backward (?<=)): a numeric char. \d followed by one or more white spaces \s+
  • Followed by (positive look forward (?=)): one white space \s

.. and for those who don't know regular expressions, the regexp engine looks for this pattern in buffer (which contains the file's content) until it finds a match, saves this match in a cell array, restarts after the match (it eats the string in some sense) and goes on iteratively matching/saving until the end of buffer. As the 3rd arg. asks for 'match', REGEXP outputs the cell array of matches.

1 Comment

Brad on 30 Apr 2013

Per, Cedric: Thanks for the push in the right direction. Code appears to be running like a champ!!

Cedric Wannaz

Contact us