Why is regexp including extra data?
1 view (last 30 days)
Show older comments
I’m trying to use REGEXP to match the following flags (State, Post, RC, State, Junk) , and then create a cell array of strings.
My inputs are:
1 25.187466 156.162447 21578.188 97.134234 State AAAAA 1 C00B
2 25.287466 156.162447 21578.288 97.234234 Post BBBBB 2 C11B
9 25.387466 156.362447 21578.388 97.334234 RC CCCCC 3 C22B
99 25.387466 156.362447 21578.388 97.334234 State DDDDD 4 C33B
999 25.387466 156.362447 21578.388 97.334234 Junk EEEEE 5 C44B
I’m using the following MATLAB commands:
data = regexp(LineTxt,'-?\d+(\.\d+)?','split');
Flag=cellstr(data{1,6});
For unknown reasons I keep getting the following output:
' State AAAAA' ' Post BBBBB' ' RC CCCCC' ' State DDDDD' ' Junk EEEEE'
Intended output is:
' State ' ' Post ' ' RC ' ' State ' ' Junk'
Why are the extra fields being included?
1 Comment
Cedric
on 29 Apr 2013
Edited: Cedric
on 29 Apr 2013
So here, what you call 'State', 'Post", are indicating places where what you want to get lies (e.g. Alabama, Michigan, etc) or do you really have to get the words 'State', 'Post', etc?
EDIT: reading your comment below, you have other line with a different structure in between these lines .. how do they look like? I mean by that, what can we use to make the match specific to these relevant lines?
Accepted Answer
per isakson
on 29 Apr 2013
Edited: per isakson
on 30 Apr 2013
Neither do I understand why. Regular expressions are tricky. Another approach:
loop over all rows
data = regexp( str, '((State)|(Post)|(RC)|(State)|(Junk))', 'match','once' );
end
[Edit] IMO Given that the file is written with a similar format string, textscan is the best way to read the file.
fid = fopen( 'cssm.txt' );
cac = textscan( fid, '%*u%*f%*f%*f%*f%s%*s%*d%*s' );
fclose( fid );
cac{:}
returns
ans =
'State'
'Post'
'RC'
'State'
'Junk'
.
Answer to comment:
str = fileread( 'cssm.txt' );
data = regexp( str, '((State)|(Post)|(RC)|(State)|(Junk))', 'match' )
returns
data =
'State' 'Post' 'RC' 'State' 'Junk'
where cssm.txt contains your five lines of data
However, the smarter the solution the harder it is to make a robust (and flexible) code. How should the code behave if there are rows in the file, which do not adhere to the "format" that I interfere from your example? And in a few days, you might find that you need the third column.
4 Comments
More Answers (1)
Cedric
on 30 Apr 2013
Edited: Cedric
on 30 Apr 2013
EDIT: if all words that you want to extract are known a priori, Per gave you the answer.
Your pattern is too generic; it splits the content each time a number is found, and what you get taking the first output arg. of REGEXP is the non-matching remainder.
See my comment below your question, and if you provide me with more information, we can build a solution based on a clean match (if possible).
To give you an example, if lines between these relevant lines that you mention were containing only numeric values, you could match the keywords that you are interested in as follows:
>> buffer = fileread('myFile.txt') ;
>> match = regexp(buffer, '(?<=\d\s+)[a-zA-Z]*(?=\s)', 'match')
match =
'State' 'Post' 'RC' 'State' 'Junk'
Note that we could use REGEXPI to make the pattern a little simpler. Here, we defined patterns to be matched as:
- As many letters as possible (only chars. a to z in lower and upper cases are allowed): [a-zA-Z]*
- Preceded by (positive look backward (?<=)): a numeric char. \d followed by one or more white spaces \s+
- Followed by (positive look forward (?=)): one white space \s
.. and for those who don't know regular expressions, the regexp engine looks for this pattern in buffer (which contains the file's content) until it finds a match, saves this match in a cell array, restarts after the match (it eats the string in some sense) and goes on iteratively matching/saving until the end of buffer. As the 3rd arg. asks for 'match', REGEXP outputs the cell array of matches.
See Also
Categories
Find more on Large Files and Big Data in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!