Removing specific characters from string in nested cells

Question

0 votes

I have a series of strings which are contained within a nested cell array (because regexp loves to nest cells), and I would like to remove any non numeric or white space characters from them so that I can convert them to doubles, namely astrick.

I'm looking for the least painful way of removing any of these special characters from all strings. I do not have a sample file to attach, sorry, but I have dictated the shape of a sample array below.

X == 1x1 cell
X{1} == 1x1 cell (because regexp can't help itself apparently)
X{1}{1} = {'1234.,  ';'12.,*  ';'1234.,  ','123.,*   ','  321.,*  '};

12 Comments
Show 10 older comments Hide 10 older comments

Bob Thompson on 13 Jun 2018

Open in MATLAB Online

Stephen, it is related to the same file, but not the same part of the file. I believe I figured the other question out, but didn't think it was elegant enough to post as an answer to my own question.

I am unable to upload an actual sample document, but a sample of what I'm extracting from would be the following.

   1  ****TABLE1****
   COLUMN1= 1.12, 2.23, 3.34, 4.45, 5.56, 6.67,
   COLUMN2= 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
   COLUMN3= 1.23, 0.34, 3.45, 5.78*, 6.54*, 8.23,
   1  ****TABLE2****

I am trying to capture the values of columns 1 and 3 from the table. I am specifically having troubles with column 3 which contains the astrick, as column 1 works fine with str2double.

col3{1 = regexp(input, '\<COLUMN3=\s*(.{1,400})1  ****','tokens');
col3{1 = regexp(col3{1{1}, '\s+','split');

I am initiating the first level of the cell as I will have multiple tables. The use of (.{1,400}) was done because I don't know how many values are in the table, and I cannot simply do (.*) because '1 **' occurs multiple times throughout the file. I don't think I can use \d or \w because of the ',' and '.' mixed in with the values. I used the second regexp to split the single string the first resulted in, as I found this more consistent with use through str2double than simply applying str2double to the entire string.

Bob Thompson on 14 Jun 2018

Edited: Bob Thompson on 14 Jun 2018

Open in MATLAB Online

Hmmm. I'm currently using fileread and just importing the entire file as a single string. I've used fgetl in the past for other scripts, but due to the variability of this file I don't know if it's a good fit. Textscan might work, but I don't know that separating by each \n will work either, as it is possible that my various bits of data will be contained on multiple lines.

I've been working with it some again today, and I realized that my previous codes work fine for the first column of values as these do not seem to ever have special characters. I can therefore get the number of values from this array, and use that to create a repeating string for the third column.

col1 = regexp(input, '\<COLUMN1=\s*(.{1,400})1  ****','tokens');
col1 = regexp(col1{1}, '\s+','split');
colvals(:,1) = str2double(col1{1});
nvals = length(colvals);
dups = repmat('(\d*.\d*).{1,3}\s*',1,nvals); % Modified from Paolo's comment
string = ['COLUMN3=\s+',dups];
col3 = regexp(input, string, 'tokens');

This seems to work, and removes the need to conduct the split a second time, which is nice.

I'm not really sure what the ':' from Paolo's comment is supposed to do, I don't see it anywhere in the regexp documentation, and it's not in any of my strings.

Also, OCDER and Paolo, I appreciate your help, so if one of you wants to write up an actual answer I would be happy to accept it.

Bob Thompson on 15 Jun 2018

Ah, I see. It doesn't appear in regexp.m comments, which is where I was looking.

Stephen23 on 15 Jun 2018

Open in MATLAB Online

@Bob Nbob: you are right, it does not appear in the Mfile help. I notice that many other useful regular expression features also do not appear in the Mfile help: notably missing are dynamic expressions, lookaround operators, and named capture.

Both the inbuilt help and the page I linked to give a very useful introduction, and explain all features of regular expressions in MATLAB:

doc regexp
doc('Regular Expressions')

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Paolo on 15 Jun 2018

Edited: Paolo on 15 Jun 2018

Open in MATLAB Online

0 votes

Perhaps this can easily be achieved in two steps. For your input:

    1  ****TABLE1****
   COLUMN1= 1.12, 2.23, 3.34, 4.45, 5.56, 6.67,
   COLUMN2= 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
   COLUMN3= 1.23, 0.34, 3.45, 5.78*, 6.54*, 8.23,
   1  ****TABLE2****

Step 1. Find and replace all punctuation characters (let's say ",", "." and "*"). Live regex here .

   data = fileread('CORR.txt');
   expression_sub = '(?<=\d\.\d*\*?)([\*\.,])';
   data = regexprep(data,expression_sub,'');

Data will now not contain those characters. Data is now:

     '   1  ****TABLE1****
        COLUMN1= 1.12 2.23 3.34 4.45 5.56 6.67
        COLUMN2= 0.00 0.00 0.00 0.00 0.00 0.00
        COLUMN3= 1.23 0.34 3.45 5.78 6.54 8.23
        1  ****TABLE2****
     '

Step 2. Match your data. Live regex here. The expression is greedy and will try to match as many digit, full stop, digits combinations as it can. Therefore you don't need to repmat your expression like you showed.

 expression_match = '(?<=COLUMN[1,3]=\s)(\d.?\d*\s)*';
 [tokens,match] = regexp(data_sub,expression_match,'tokens','match');

Matlab manipulation.

 column1 = str2double(strsplit(cell2mat(tokens{1}),' '));
 column3 = str2double(strsplit(cell2mat(tokens{2}),' '));

column1 =

1.1200 2.2300 3.3400 4.4500 5.5600 6.6700

column3 =

1.2300 0.3400 3.4500 5.7800 6.5400 8.2300

2 Comments
Show None Hide None

Bob Thompson on 18 Jun 2018

Ha, using (\d.?\d*\s)* is pretty slick. I'm a little sad I didn't think of that.

Stephen23 on 30 Dec 2022

Open in MATLAB Online

@Bob Thompson: the dot needs to be escaped as well (otherwise it matches all characters), e.g.:

(\d+\.?\d*\s)*

Sign in to comment.

Answer 2

George Abrahams on 30 Dec 2022

Open in MATLAB Online

0 votes

The others are right to fix the root problem causing the tricky nested cell array. Having said that, for future reference, my deepreplace function on File Exchange / GitHub would have done exactly what you requested.

x = {{{'1234.,  ';'12.,*  ';'1234.,  ';'123.,*   ';'  321.,*  '}}};
% Remove any character except for digits (0-9) and period (.)
match = regexpPattern('[^\d.]');
x = deepreplace(x,match,'');
% x = 1×1 cell array
%     {1×1 cell}
% x{1} = 1×1 cell array
%     {5×1 cell}
% x{1}{1} = 5×1 cell array
%     {'1234.'}
%     {'12.'  }
%     {'1234.'}
%     {'12310'}
%     {'321.' }

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Removing specific characters from string in nested cells

12 Comments
Show 10 older comments Hide 10 older comments

Accepted Answer

2 Comments
Show None Hide None

More Answers (1)

0 Comments
Show -2 older comments Hide -2 older comments

Categories

Products

Tags

Community Treasure Hunt

Removing specific characters from string in nested cells

12 Comments Show 10 older comments Hide 10 older comments

Accepted Answer

2 Comments Show None Hide None

More Answers (1)

0 Comments Show -2 older comments Hide -2 older comments

Categories

Products

Tags

See Also

Community Treasure Hunt

12 Comments
Show 10 older comments Hide 10 older comments

2 Comments
Show None Hide None

0 Comments
Show -2 older comments Hide -2 older comments