Why is regexp including extra data?

1 view (last 30 days)
I’m trying to use REGEXP to match the following flags (State, Post, RC, State, Junk) , and then create a cell array of strings.
My inputs are:
1 25.187466 156.162447 21578.188 97.134234 State AAAAA 1 C00B
2 25.287466 156.162447 21578.288 97.234234 Post BBBBB 2 C11B
9 25.387466 156.362447 21578.388 97.334234 RC CCCCC 3 C22B
99 25.387466 156.362447 21578.388 97.334234 State DDDDD 4 C33B
999 25.387466 156.362447 21578.388 97.334234 Junk EEEEE 5 C44B
I’m using the following MATLAB commands:
data = regexp(LineTxt,'-?\d+(\.\d+)?','split');
Flag=cellstr(data{1,6});
For unknown reasons I keep getting the following output:
' State AAAAA' ' Post BBBBB' ' RC CCCCC' ' State DDDDD' ' Junk EEEEE'
Intended output is:
' State ' ' Post ' ' RC ' ' State ' ' Junk'
Why are the extra fields being included?
  1 Comment
Cedric
Cedric on 29 Apr 2013
Edited: Cedric on 29 Apr 2013
So here, what you call 'State', 'Post", are indicating places where what you want to get lies (e.g. Alabama, Michigan, etc) or do you really have to get the words 'State', 'Post', etc?
EDIT: reading your comment below, you have other line with a different structure in between these lines .. how do they look like? I mean by that, what can we use to make the match specific to these relevant lines?

Sign in to comment.

Accepted Answer

per isakson
per isakson on 29 Apr 2013
Edited: per isakson on 30 Apr 2013
Neither do I understand why. Regular expressions are tricky. Another approach:
loop over all rows
data = regexp( str, '((State)|(Post)|(RC)|(State)|(Junk))', 'match','once' );
end
[Edit] IMO Given that the file is written with a similar format string, textscan is the best way to read the file.
fid = fopen( 'cssm.txt' );
cac = textscan( fid, '%*u%*f%*f%*f%*f%s%*s%*d%*s' );
fclose( fid );
cac{:}
returns
ans =
'State'
'Post'
'RC'
'State'
'Junk'
.
Answer to comment:
str = fileread( 'cssm.txt' );
data = regexp( str, '((State)|(Post)|(RC)|(State)|(Junk))', 'match' )
returns
data =
'State' 'Post' 'RC' 'State' 'Junk'
where cssm.txt contains your five lines of data
However, the smarter the solution the harder it is to make a robust (and flexible) code. How should the code behave if there are rows in the file, which do not adhere to the "format" that I interfere from your example? And in a few days, you might find that you need the third column.
  4 Comments
Cedric
Cedric on 30 Apr 2013
Edited: Cedric on 30 Apr 2013
I'm voting for Per's edit if all you need is to have the sequence of these words and that they are known a priori.
Brad
Brad on 30 Apr 2013
Per, Cedric: First off, thank you for taking a look at this problem. Even though the REGEXP function appears to be quite valuable, I have little experience with it. After some additional reading last night, I varied my approach to solving this problem by implementing/testing the following:
data = regexp(LineTxt,' +','split'); TrajFlag=cellstr(data{1,6});
Result:
'97.134234' '97.234234' '97.334234' 'State' 'Junk'
In executing this, I believe I found the source of the problem. Each row of data begins with a number (in this case numbers 1, 2, 9, 99, and 999). For values greater than or equal to 10, this problem doesn’t exist. I spoke with a one of the developers this morning and verified that the values are between 1 and 999 – but are always in the least digits format. So instead of 001, their code outputs 1. Instead of 010, it outputs 10. For all 3-digit values (100-999), the output is always in 3 digits.
So I get '97.134234' '97.234234' '97.334234' instead of ‘State’ ‘Post’ ‘RC’ when using the above commands.
If data = regexp(LineTxt,' +','split') (one space removed), I get the following result;
'97.134234' '97.234234' '97.334234' '97.334234' 'Junk'
If data = regexp(LineTxt,' +','split') (3 spaces prior to the +), I get the following error;
Index exceeds matrix dimensions. Error in RSA(line 102) TrajFlag=cellstr(data{1,6});
I’m just reading your comments now:
1. The file I read in has a repeating structure that looks like this:
01Jan2013 Live 2 1 01 0000000001 Low
1 25.187466 156.162447 21578.188 97.134234 State AAAAA 1 C00B
21600.123 21612.122
3100.435 4100.380 5100.739
-0.491736 1.491492 0.891808
0.051748 -0.071254 0.021175
9 x 9 matrix of floats
01Jan2013 Live 2 1 01 0000000002 Low
2 25.287466 156.162447 21578.288 97.234234 Post BBBBB 2 C11B
21602.223 21612.222
3200.435 4200.380 5200.739
-0.492736 1.492492 0.892180
0.052748 -0.072254 0.022175
9 x 9 matrix of floats
01Jan2013 Live 2 1 01 0000000003 Low
9 25.387466 156.362447 21578.388 97.334234 RC CCCCC 3 C22B
21604.333 21612.333
3300.435 4300.380 5300.739
-0.493736 1.493492 0.893180
0.053748 -0.073254 0.023175
.
.
.
.
.
.
.
.
Knowing I receive a number of messages in a repeating pattern, I’ve implemented the following, where the line in question for each message is line number 2. for scr = 1:Num_MESs
while ~feof(fid)
LineNum = LineNum + 1;
.
.
.
.
.
Read one line (row) of text from the input file at a time
LineTxt = fgetl(fid);
if LineNum == 2
data = regexp(LineTxt,' +','split');
Flag=cellstr(data{1,6});
end
.
.
.
.
.
end
fclose(fid);
end
2. I agree that my original approach was too generic to get the intended result – the words ‘State’ ‘Post’ ‘RC’ ‘State’ ‘Junk’. I’m working with the match string output in hopes of achieving the desired result.
Thanks for your inputs.

Sign in to comment.

More Answers (1)

Cedric
Cedric on 30 Apr 2013
Edited: Cedric on 30 Apr 2013
EDIT: if all words that you want to extract are known a priori, Per gave you the answer.
Your pattern is too generic; it splits the content each time a number is found, and what you get taking the first output arg. of REGEXP is the non-matching remainder.
See my comment below your question, and if you provide me with more information, we can build a solution based on a clean match (if possible).
To give you an example, if lines between these relevant lines that you mention were containing only numeric values, you could match the keywords that you are interested in as follows:
>> buffer = fileread('myFile.txt') ;
>> match = regexp(buffer, '(?<=\d\s+)[a-zA-Z]*(?=\s)', 'match')
match =
'State' 'Post' 'RC' 'State' 'Junk'
Note that we could use REGEXPI to make the pattern a little simpler. Here, we defined patterns to be matched as:
  • As many letters as possible (only chars. a to z in lower and upper cases are allowed): [a-zA-Z]*
  • Preceded by (positive look backward (?<=)): a numeric char. \d followed by one or more white spaces \s+
  • Followed by (positive look forward (?=)): one white space \s
.. and for those who don't know regular expressions, the regexp engine looks for this pattern in buffer (which contains the file's content) until it finds a match, saves this match in a cell array, restarts after the match (it eats the string in some sense) and goes on iteratively matching/saving until the end of buffer. As the 3rd arg. asks for 'match', REGEXP outputs the cell array of matches.
  1 Comment
Brad
Brad on 30 Apr 2013
Per, Cedric: Thanks for the push in the right direction. Code appears to be running like a champ!!

Sign in to comment.

Categories

Find more on Large Files and Big Data in Help Center and File Exchange

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!