Removing specific characters from string in nested cells

I have a series of strings which are contained within a nested cell array (because regexp loves to nest cells), and I would like to remove any non numeric or white space characters from them so that I can convert them to doubles, namely astrick.
I'm looking for the least painful way of removing any of these special characters from all strings. I do not have a sample file to attach, sorry, but I have dictated the shape of a sample array below.
X == 1x1 cell
X{1} == 1x1 cell (because regexp can't help itself apparently)
X{1}{1} = {'1234., ';'12.,* ';'1234., ','123.,* ',' 321.,* '};

12 Comments

@ Bob Nbob: is this related to your earlier question?:
If so, it would probably be easier to fix the regular expression. Please upload a sample file that you want to get the data from.
Why not just
x = {'1234., ';'12.,* ';'1234., ';'123.,* ';' 321.,* '};
x = regexprep(x,'[^\d]','');
?
As mentioned by Stephen, it's probably easier to fix the regex used in your earlier question. I left a comment there.
Stephen, it is related to the same file, but not the same part of the file. I believe I figured the other question out, but didn't think it was elegant enough to post as an answer to my own question.
I am unable to upload an actual sample document, but a sample of what I'm extracting from would be the following.
1 ****TABLE1****
COLUMN1= 1.12, 2.23, 3.34, 4.45, 5.56, 6.67,
COLUMN2= 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
COLUMN3= 1.23, 0.34, 3.45, 5.78*, 6.54*, 8.23,
1 ****TABLE2****
I am trying to capture the values of columns 1 and 3 from the table. I am specifically having troubles with column 3 which contains the astrick, as column 1 works fine with str2double.
col3{1 = regexp(input, '\<COLUMN3=\s*(.{1,400})1 ****','tokens');
col3{1 = regexp(col3{1{1}, '\s+','split');
I am initiating the first level of the cell as I will have multiple tables. The use of (.{1,400}) was done because I don't know how many values are in the table, and I cannot simply do (.*) because '1 **' occurs multiple times throughout the file. I don't think I can use \d or \w because of the ',' and '.' mixed in with the values. I used the second regexp to split the single string the first resulted in, as I found this more consistent with use through str2double than simply applying str2double to the entire string.
Not the prettiest but does the job, try this:
[tokens,matches]=regexp(yourtext,'(COLUMN[1,3]=\s*)(\d*.?\d*)(?:\,\s*)(\d*.\d*)(?:\,\s*)(\d*.\d*)(?:\,\s*)(\d*.\d*)(?:\*?\,\s*)(\d*.\d*)(?:\*?\,\s*)(\d*.\d*)(?:\*?\,\s)','tokens','match');
tokens{1}:
1×7 cell array
{'COLUMN1= '} {'1.12'} {'2.23'} {'3.34'} {'4.45'} {'5.56'} {'6.67'}
tokens{2}:
1×7 cell array
{'COLUMN3= '} {'1.23'} {'0.34'} {'3.45'} {'5.78'} {'6.54'} {'8.23'}
I tried re-using the group like this but only seems to work on PCRE.
Yeah, I used that grouping method with a different section by creating a string outside of the regexp and then putting the string variable into regexp.
grouping = repmat('(\d*.?\d*)(?:\,\s*)',1,number);
string = ['LEADER=\s*',grouping,'TAIL');
result = regexp(text, string,'tokens');
It worked well for that section because I had a flag that I could get for how many times the values would repeat, but I don't have that for this part of the file.
Would something like this work?
Str = 'COLUMN3= 1.23, 0.34, 3.45, 5.78*, 6.54*, 8.23, 2, -3., 24.*';
EqIdx = find(Str == '=', 1);
if ~isempty(EqIdx)
Num = str2double(regexp(Str(EqIdx+1:end), '\-?\d+\.?\d*', 'match'));
end
Hmm, no, but this might be on the right track. The problem I have with it is that = is too generic of a value, so I end up with way too many results.
Might need more information of the start-to-end issue you're having. How are you reading in the text file? With fileread or fgetl or textscan? If you use fgetl or textscan, then you can get each row of text and then get the one you want. If you're using fileread, then it's much harder.
FID = fopen('textfile.txt');
TXT = textscan(FID, '%s', 'Delimiter', '\n');
TXT = TXT{1};
fclose(FID);
Num = cell(size(TXT));
for f = 1:length(TXT)
Str = TXT{f};
if contains(Str, 'CONTAINS=') %Specify condition for line you want here
EqIdx = find(Str == '=', 1); %Example, you want values after "="???
Num{f} = str2double(regexp(Str(EqIdx+1:end), '\-?\d+\.?\d*', 'match'));
end
end
Hmmm. I'm currently using fileread and just importing the entire file as a single string. I've used fgetl in the past for other scripts, but due to the variability of this file I don't know if it's a good fit. Textscan might work, but I don't know that separating by each \n will work either, as it is possible that my various bits of data will be contained on multiple lines.
I've been working with it some again today, and I realized that my previous codes work fine for the first column of values as these do not seem to ever have special characters. I can therefore get the number of values from this array, and use that to create a repeating string for the third column.
col1 = regexp(input, '\<COLUMN1=\s*(.{1,400})1 ****','tokens');
col1 = regexp(col1{1}, '\s+','split');
colvals(:,1) = str2double(col1{1});
nvals = length(colvals);
dups = repmat('(\d*.\d*).{1,3}\s*',1,nvals); % Modified from Paolo's comment
string = ['COLUMN3=\s+',dups];
col3 = regexp(input, string, 'tokens');
This seems to work, and removes the need to conduct the split a second time, which is nice.
I'm not really sure what the ':' from Paolo's comment is supposed to do, I don't see it anywhere in the regexp documentation, and it's not in any of my strings.
Also, OCDER and Paolo, I appreciate your help, so if one of you wants to write up an actual answer I would be happy to accept it.
"I'm not really sure what the ':' from Paolo's comment is supposed to do, I don't see it anywhere in the regexp documentation..."
Open the documentation, then use ctrl+f to search the webpage for ?:
Ah, I see. It doesn't appear in regexp.m comments, which is where I was looking.
@Bob Nbob: you are right, it does not appear in the Mfile help. I notice that many other useful regular expression features also do not appear in the Mfile help: notably missing are dynamic expressions, lookaround operators, and named capture.
Both the inbuilt help and the page I linked to give a very useful introduction, and explain all features of regular expressions in MATLAB:
doc regexp
doc('Regular Expressions')

Sign in to comment.

 Accepted Answer

Perhaps this can easily be achieved in two steps. For your input:
1 ****TABLE1****
COLUMN1= 1.12, 2.23, 3.34, 4.45, 5.56, 6.67,
COLUMN2= 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
COLUMN3= 1.23, 0.34, 3.45, 5.78*, 6.54*, 8.23,
1 ****TABLE2****
Step 1. Find and replace all punctuation characters (let's say ",", "." and "*"). Live regex here .
data = fileread('CORR.txt');
expression_sub = '(?<=\d\.\d*\*?)([\*\.,])';
data = regexprep(data,expression_sub,'');
Data will now not contain those characters. Data is now:
' 1 ****TABLE1****
COLUMN1= 1.12 2.23 3.34 4.45 5.56 6.67
COLUMN2= 0.00 0.00 0.00 0.00 0.00 0.00
COLUMN3= 1.23 0.34 3.45 5.78 6.54 8.23
1 ****TABLE2****
'
Step 2. Match your data. Live regex here. The expression is greedy and will try to match as many digit, full stop, digits combinations as it can. Therefore you don't need to repmat your expression like you showed.
expression_match = '(?<=COLUMN[1,3]=\s)(\d.?\d*\s)*';
[tokens,match] = regexp(data_sub,expression_match,'tokens','match');
Matlab manipulation.
column1 = str2double(strsplit(cell2mat(tokens{1}),' '));
column3 = str2double(strsplit(cell2mat(tokens{2}),' '));
column1 =
1.1200 2.2300 3.3400 4.4500 5.5600 6.6700
column3 =
1.2300 0.3400 3.4500 5.7800 6.5400 8.2300

2 Comments

Ha, using (\d.?\d*\s)* is pretty slick. I'm a little sad I didn't think of that.
@Bob Thompson: the dot needs to be escaped as well (otherwise it matches all characters), e.g.:
(\d+\.?\d*\s)*

Sign in to comment.

More Answers (1)

The others are right to fix the root problem causing the tricky nested cell array. Having said that, for future reference, my deepreplace function on File Exchange / GitHub would have done exactly what you requested.
x = {{{'1234., ';'12.,* ';'1234., ';'123.,* ';' 321.,* '}}};
% Remove any character except for digits (0-9) and period (.)
match = regexpPattern('[^\d.]');
x = deepreplace(x,match,'');
% x = 1×1 cell array
% {1×1 cell}
% x{1} = 1×1 cell array
% {5×1 cell}
% x{1}{1} = 5×1 cell array
% {'1234.'}
% {'12.' }
% {'1234.'}
% {'12310'}
% {'321.' }

Categories

Products

Asked:

on 13 Jun 2018

Commented:

on 30 Dec 2022

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!