Got Questions? Get Answers.
Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Thread Subject:
manipulating strings

Subject: manipulating strings

From: arun

Date: 7 Jul, 2009 11:29:50

Message: 1 of 16

Hi,

suppose I have a string A whose size is 1*10^7. I would now like to
remove certain characters in the string. I tried strfind and regexprep
as follows

A(strfind(A, ',')) = ''; %replace entries with a comma with nothing
and then i repeat this for all numbers from 0 to 9 and for "space".

Alternative efficient way i hoped would be,
A = regexprep(A, "[0-9, ]", '');
but the first expression takes for ever as the vector is long and the
second one strangely gives me "out of memory" error...


any ways to speed up?

thank you very much,
arun.

Subject: manipulating strings

From: nor ki

Date: 7 Jul, 2009 12:28:01

Message: 2 of 16

arun <aragorn168b@gmail.com> wrote in message <87c67726-964b-48ce-80f0-a50d24b62cd1@26g2000yqk.googlegroups.com>...
> Hi,
>
> suppose I have a string A whose size is 1*10^7. I would now like to
> remove certain characters in the string. I tried strfind and regexprep
> as follows
>
> A(strfind(A, ',')) = ''; %replace entries with a comma with nothing
> and then i repeat this for all numbers from 0 to 9 and for "space".
>
> Alternative efficient way i hoped would be,
> A = regexprep(A, "[0-9, ]", '');
> but the first expression takes for ever as the vector is long and the
> second one strangely gives me "out of memory" error...
>
>
> any ways to speed up?
>
> thank you very much,
> arun.

Hi Arun,
as you only look for single characters you could build a lookup table of type logical which contains true for each of the desired characters and false for the characters which should be removed.
call this one just lut

then you make an array for the positions of your desired characters:

idx = lut(A);

and get them back in A

A = A(idx);

or in short:

A = A(lut(A));

hth
kinor

Subject: manipulating strings

From: arun

Date: 7 Jul, 2009 17:30:14

Message: 3 of 16

On Jul 7, 2:28 pm, "nor ki" <kinor.remov...@gmx.de> wrote:
> arun <aragorn1...@gmail.com> wrote in message <87c67726-964b-48ce-80f0-a50d24b62...@26g2000yqk.googlegroups.com>...
> > Hi,
>
> > suppose I have astringA whose size is 1*10^7. I would now like to
> > remove certain characters in thestring. I tried strfind and regexprep
> > as follows
>
> > A(strfind(A, ',')) = ''; %replace entries with a comma with nothing
> > and then i repeat this for all numbers from 0 to 9 and for "space".
>
> > Alternative efficient way i hoped would be,
> > A = regexprep(A, "[0-9, ]", '');
> > but the first expression takes for ever as the vector is long and the
> > second one strangely gives me "out of memory" error...
>
> > any ways to speed up?
>
> > thank you very much,
> > arun.
>
> Hi Arun,
> as you only look for single characters you could build a lookup table of type logical which contains true for each of the desired characters and false for the characters which should be removed.
> call this one just lut
>
> then you make an array for the positions of your desired characters:
>
> idx = lut(A);
>
> and get them back in A
>
> A = A(idx);
>
> or in short:
>
> A = A(lut(A));
>
> hth
> kinor

Hi Kinor,

Thank you for the suggestion. I just have some trouble understanding
how to construct this lut. Is it like a map? because I have to know
this character has a true and this character has a false...

suppose A = "1,1600,A,G,G,G,A,A,A,G,A,A,G";

and here I don't need the comma, and the numbers 1 and 1600, that is,
the desired string is A = "AGGGAAAGAAG"
if i don't have a map, then my look up table should consist of values
for all entries, right? I don't think you suggested that way.... I
mean,

lut = [0,0,0,0,0,0,0,1,0,1,0,1...] and then use A = lut(A)...
is this what you suggested?
thank you very much,
arun.

Subject: manipulating strings

From: arun

Date: 7 Jul, 2009 17:48:57

Message: 4 of 16

On Jul 7, 2:28 pm, "nor ki" <kinor.remov...@gmx.de> wrote:
> arun <aragorn1...@gmail.com> wrote in message <87c67726-964b-48ce-80f0-a50d24b62...@26g2000yqk.googlegroups.com>...
> > Hi,
>
> > suppose I have a string A whose size is 1*10^7. I would now like to
> > remove certain characters in the string. I tried strfind and regexprep
> > as follows
>
> > A(strfind(A, ',')) = ''; %replace entries with a comma with nothing
> > and then i repeat this for all numbers from 0 to 9 and for "space".
>
> > Alternative efficient way i hoped would be,
> > A = regexprep(A, "[0-9, ]", '');
> > but the first expression takes for ever as the vector is long and the
> > second one strangely gives me "out of memory" error...
>
> > any ways to speed up?
>
> > thank you very much,
> > arun.
>
> Hi Arun,
> as you only look for single characters you could build a lookup table of type logical which contains true for each of the desired characters and false for the characters which should be removed.
> call this one just lut
>
> then you make an array for the positions of your desired characters:
>
> idx = lut(A);
>
> and get them back in A
>
> A = A(idx);
>
> or in short:
>
> A = A(lut(A));
>
> hth
> kinor

Hi,

I tried it like this...

lut = 'AGCT';
%str is a 1*100million string.

str = str(ismember(str,lut));

this seems to work pretty fast for 10^7 but not for 10^8 or 10^ 9 as
it gives out of memory error. But I guess, this should be pretty fast
for parsing using a for loop and taking 10^7 entries at a time...

thank you... i would appreciate it if some1 could let me know of
better methods available.

thanks,
arun.

Subject: manipulating strings

From: Rune Allnor

Date: 7 Jul, 2009 18:27:54

Message: 5 of 16

On 7 Jul, 13:29, arun <aragorn1...@gmail.com> wrote:
> Hi,
>
> suppose I have a string A whose size is 1*10^7. I would now like to
> remove certain characters in the string. I tried strfind and regexprep
> as follows
>
> A(strfind(A, ',')) = ''; %replace entries with a comma with nothing
> and then i repeat this for all numbers from 0 to 9 and for "space".
>
> Alternative efficient way i hoped would be,
> A = regexprep(A, "[0-9, ]", '');
> but the first expression takes for ever as the vector is long

Every time this finds a match, the vector is shortened by one
character, meaning you need to allocate space re-shuffle the
contents every time one finds a hit. This would probably be
better:

rexp = '[^0-9, ]';
idx = regex(A,rexp);
A = A(idx);

Rune

Subject: manipulating strings

From: nor ki

Date: 8 Jul, 2009 07:07:01

Message: 6 of 16

arun <aragorn168b@gmail.com> wrote in message <5e0e7082-b295-4156-9bfc-56540c99cf65@h18g2000yqj.googlegroups.com>...
> On Jul 7, 2:28?pm, "nor ki" <kinor.remov...@gmx.de> wrote:
> > arun <aragorn1...@gmail.com> wrote in message <87c67726-964b-48ce-80f0-a50d24b62...@26g2000yqk.googlegroups.com>...
> > > Hi,
> >
> > > suppose I have a string A whose size is 1*10^7. I would now like to
> > > remove certain characters in the string. I tried strfind and regexprep
> > > as follows
> >
> > > A(strfind(A, ',')) = ''; %replace entries with a comma with nothing
> > > and then i repeat this for all numbers from 0 to 9 and for "space".
> >
> > > Alternative efficient way i hoped would be,
> > > A = regexprep(A, "[0-9, ]", '');
> > > but the first expression takes for ever as the vector is long and the
> > > second one strangely gives me "out of memory" error...
> >
> > > any ways to speed up?
> >
> > > thank you very much,
> > > arun.
> >
> > Hi Arun,
> > as you only look for single characters you could build a lookup table of type logical which contains true for each of the desired characters and false for the characters which should be removed.
> > call this one just lut
> >
> > then you make an array for the positions of your desired characters:
> >
> > idx = lut(A);
> >
> > and get them back in A
> >
> > A = A(idx);
> >
> > or in short:
> >
> > A = A(lut(A));
> >
> > hth
> > kinor
>
> Hi,
>
> I tried it like this...
>
> lut = 'AGCT';
> %str is a 1*100million string.
>
> str = str(ismember(str,lut));
>
> this seems to work pretty fast for 10^7 but not for 10^8 or 10^ 9 as
> it gives out of memory error. But I guess, this should be pretty fast
> for parsing using a for loop and taking 10^7 entries at a time...
>
> thank you... i would appreciate it if some1 could let me know of
> better methods available.
>
> thanks,
> arun.
Hi Arun,

str = str(ismember(str,lut));
applies ismember for the whole variable str

do it like this:

lut = ~ismember(1:256,removechars);
str = str(lut(str));

where removechars are the characters you want to be removed.

for 10^8 you really have to use a loop, maybe Runes idea works better then..

kinor

Subject: manipulating strings

From: us

Date: 8 Jul, 2009 07:49:01

Message: 7 of 16

arun <aragorn168b@gmail.com> wrote in message <87c67726-964b-48ce-80f0-a50d24b62cd1@26g2000yqk.googlegroups.com>...
> Hi,
>
> suppose I have a string A whose size is 1*10^7. I would now like to
> remove certain characters in the string. I tried strfind and regexprep
> as follows
>
> A(strfind(A, ',')) = ''; %replace entries with a comma with nothing
> and then i repeat this for all numbers from 0 to 9 and for "space".
>
> Alternative efficient way i hoped would be,
> A = regexprep(A, "[0-9, ]", '');
> but the first expression takes for ever as the vector is long and the
> second one strangely gives me "out of memory" error...
>
>
> any ways to speed up?
>
> thank you very much,
> arun.

one of the solutions
- use ISMEMBC rather than ISMEMBER

     clear ix v; % <- save old stuff
     tmpl='0':'z';
     v=repmat(tmpl(randperm(numel(tmpl))),1,600000);
     size(v,2)
% ans = 45,000,000
     tmpl=sort(['0':'9',',']); % <- must be SORTed!
tic;
     ix=ismembc(v,tmpl);
toc
%{
Elapsed time is 0.558719 seconds.
% wintel system: ic2/2*2.6gzh/2mb/winxp.sp3/r2009a
%}

us

Subject: manipulating strings

From: nor ki

Date: 8 Jul, 2009 09:06:01

Message: 8 of 16

"us " <us@neurol.unizh.ch> wrote in message <h31j1d$b16$1@fred.mathworks.com>...
> arun <aragorn168b@gmail.com> wrote in message <87c67726-964b-48ce-80f0-a50d24b62cd1@26g2000yqk.googlegroups.com>...
> > Hi,
> >
> > suppose I have a string A whose size is 1*10^7. I would now like to
> > remove certain characters in the string. I tried strfind and regexprep
> > as follows
> >
> > A(strfind(A, ',')) = ''; %replace entries with a comma with nothing
> > and then i repeat this for all numbers from 0 to 9 and for "space".
> >
> > Alternative efficient way i hoped would be,
> > A = regexprep(A, "[0-9, ]", '');
> > but the first expression takes for ever as the vector is long and the
> > second one strangely gives me "out of memory" error...
> >
> >
> > any ways to speed up?
> >
> > thank you very much,
> > arun.
>
> one of the solutions
> - use ISMEMBC rather than ISMEMBER
>
> clear ix v; % <- save old stuff
> tmpl='0':'z';
> v=repmat(tmpl(randperm(numel(tmpl))),1,600000);
> size(v,2)
> % ans = 45,000,000
> tmpl=sort(['0':'9',',']); % <- must be SORTed!
> tic;
> ix=ismembc(v,tmpl);
> toc
> %{
> Elapsed time is 0.558719 seconds.
> % wintel system: ic2/2*2.6gzh/2mb/winxp.sp3/r2009a
> %}
>
> us

Hi Us,

where did you find ismembc? is there a place to find undocumented functions?

kinor

tmpl='0':'z';
strvar=repmat(tmpl(randperm(numel(tmpl))),1,1e6);
removechars=sort(['0':'9',',']); % <- must be SORTed!

tic
    lut1 = ~ismember(1:256,removechars);
    res1 = strvar(lut1(strvar));
toc

tic
    lut2 = ~ismembc(strvar, removechars);
    res2 = strvar(lut2);
toc

isequal(res1, res2)

Elapsed time is 1.523525 seconds.
Elapsed time is 1.862163 seconds.

ans =

     1

Subject: manipulating strings

From: us

Date: 8 Jul, 2009 09:21:01

Message: 9 of 16

"nor ki"
> where did you find ismembc? is there a place to find undocumented functions?

it is not an undocumented function...
rather, look at ISMEMBER

     edit ismember;
% and you'll find this at line #121 - 127
%{
      % Two C-Helper Functions are used in the code below:
      
      % ISMEMBC - S must be sorted - Returns logical vector indicating which
      % elements of A occur in S
      % ISMEMBC2 - S must be sorted - Returns a vector of the locations of
      % the elements of A occurring in S. If multiple instances occur,
      % the last occurrence is returned
%}
% then, being an investigative person, you'll immediately do this
     which ismembc;
% MLROOT\toolbox\matlab\ops\ismembc.mexw32 % <- a MEX...
% and play with it in the command window (timing and so on)

it's often worthwhile to look at ML stock functions to
- see how TMW does things (not always optimized...)
- look for hidden gems...

us

Subject: manipulating strings

From: nor ki

Date: 8 Jul, 2009 09:29:02

Message: 10 of 16

"us " <us@neurol.unizh.ch> wrote in message <h31odt$6aq$1@fred.mathworks.com>...
> "nor ki"
> > where did you find ismembc? is there a place to find undocumented functions?
>
> it is not an undocumented function...
> rather, look at ISMEMBER
>
> edit ismember;
> % and you'll find this at line #121 - 127
> %{
> % Two C-Helper Functions are used in the code below:
>
> % ISMEMBC - S must be sorted - Returns logical vector indicating which
> % elements of A occur in S
> % ISMEMBC2 - S must be sorted - Returns a vector of the locations of
> % the elements of A occurring in S. If multiple instances occur,
> % the last occurrence is returned
> %}
> % then, being an investigative person, you'll immediately do this
> which ismembc;
> % MLROOT\toolbox\matlab\ops\ismembc.mexw32 % <- a MEX...
> % and play with it in the command window (timing and so on)
>
> it's often worthwhile to look at ML stock functions to
> - see how TMW does things (not always optimized...)
> - look for hidden gems...
>
> us

Hi US,

thank you for the hint

kinor

Subject: manipulating strings

From: arun

Date: 8 Jul, 2009 10:38:09

Message: 11 of 16

On Jul 8, 11:21 am, "us " <u...@neurol.unizh.ch> wrote:
> "nor ki"
>
> > where did you find ismembc? is there a place to find undocumented functions?
>
> it is not an undocumented function...
> rather, look at ISMEMBER
>
>      edit ismember;
> % and you'll find this at line #121 - 127
> %{
>       % Two C-Helper Functions are used in the code below:
>
>       % ISMEMBC  - S must be sorted - Returns logical vector indicating which
>       % elements of A occur in S
>       % ISMEMBC2 - S must be sorted - Returns a vector of the locations of
>       % the elements of A occurring in S.  If multiple instances occur,
>       % the last occurrence is returned          
> %}
> % then, being an investigative person, you'll immediately do this
>      which ismembc;
> % MLROOT\toolbox\matlab\ops\ismembc.mexw32     % <- a MEX...
> % and play with it in the command window (timing and so on)
>
> it's often worthwhile to look at ML stock functions to
> - see how TMW does things (not always optimized...)
> - look for hidden gems...
>
> us

nor ki,

thank you for your suggestions. They work very well. Now my next
formidable task is to reshape this vector to a 192*240605 matrix. (My
actual task is to parse a file which is 270 MB long line by line and
do these operations. But I found another topic in whice UWE has shown
the fastest way to read a whole file onto a variable using fread and
now, I am trying to remove the unwanted entries and then shape them in
to the desired matrix. The old line-by-line method takes about 30-45
mins on this old computer.. so far, before the reshape step, without
out of memory error, it takes 1.5 mins. let me see!! )

Uwe, yours also works like a charm. I personally dont see a difference
between ismember and ismembc, at least on this machine! :)

Rune, regexp and regexprep both give the "out of memory" error when
used on such long strings on my slowwww computer... (at my work).
I guess, it will be faster and able to be run on my new laptop...
still waiting ........


thanks again guys,
best,
arun.

Subject: manipulating strings

From: Rune Allnor

Date: 8 Jul, 2009 11:01:04

Message: 12 of 16

On 8 Jul, 12:38, arun <aragorn1...@gmail.com> wrote:

> Rune, regexp and regexprep both give the "out of memory" error when
> used on such long strings on my slowwww computer... (at my work).
> I guess, it will be faster and able to be run on my new laptop...
> still waiting ........

Why do you need to process the whole string at once?
Just do like everybody else and split it up in many
manageable parts.

Rune

Subject: manipulating strings

From: arun

Date: 8 Jul, 2009 14:16:08

Message: 13 of 16

On Jul 8, 1:01 pm, Rune Allnor <all...@tele.ntnu.no> wrote:
> On 8 Jul, 12:38, arun <aragorn1...@gmail.com> wrote:
>
> > Rune, regexp and regexprep both give the "out of memory" error when
> > used on such longstringson my slowwww computer... (at my work).
> > I guess, it will be faster and able to be run on my new laptop...
> > still waiting ........
>
> Why do you need to process the whole string at once?
> Just do like everybody else and split it up in many
> manageable parts.
>
> Rune

Yes, that is an alternative I have been using for quite a while coz of
my system limitations. I wanted to know about other indexing methods
that could accomplish this task even in my case, like the logical
indexing with ismember shown by some of the members.

thank you very much,
best, arun.

Subject: manipulating strings

From: Loren Shure

Date: 8 Jul, 2009 17:50:34

Message: 14 of 16

In article <d9639397-6d06-4e75-89fc-
b6e988b1f16d@h2g2000yqg.googlegroups.com>, aragorn168b@gmail.com says...
> On Jul 8, 11:21 am, "us " <u...@neurol.unizh.ch> wrote:
> > "nor ki"
> >
> > > where did you find ismembc? is there a place to find undocumented functions?
> >
> > it is not an undocumented function...
> > rather, look at ISMEMBER
> >
> >      edit ismember;
> > % and you'll find this at line #121 - 127
> > %{
> >       % Two C-Helper Functions are used in the code below:
> >
> >       % ISMEMBC  - S must be sorted - Returns logical vector indicating which
> >       % elements of A occur in S
> >       % ISMEMBC2 - S must be sorted - Returns a vector of the locations of
> >       % the elements of A occurring in S.  If multiple instances occur,
> >       % the last occurrence is returned          
> > %}
> > % then, being an investigative person, you'll immediately do this
> >      which ismembc;
> > % MLROOT\toolbox\matlab\ops\ismembc.mexw32     % <- a MEX...
> > % and play with it in the command window (timing and so on)
> >
> > it's often worthwhile to look at ML stock functions to
> > - see how TMW does things (not always optimized...)
> > - look for hidden gems...
> >
> > us
>
> nor ki,
>
> thank you for your suggestions. They work very well. Now my next
> formidable task is to reshape this vector to a 192*240605 matrix. (My
> actual task is to parse a file which is 270 MB long line by line and
> do these operations. But I found another topic in whice UWE has shown
> the fastest way to read a whole file onto a variable using fread and
> now, I am trying to remove the unwanted entries and then shape them in
> to the desired matrix. The old line-by-line method takes about 30-45
> mins on this old computer.. so far, before the reshape step, without
> out of memory error, it takes 1.5 mins. let me see!! )
>
> Uwe, yours also works like a charm. I personally dont see a difference
> between ismember and ismembc, at least on this machine! :)
>
> Rune, regexp and regexprep both give the "out of memory" error when
> used on such long strings on my slowwww computer... (at my work).
> I guess, it will be faster and able to be run on my new laptop...
> still waiting ........
>
>
> thanks again guys,
> best,
> arun.
>

Help reshape.

--
Loren
http://blogs.mathworks.com/loren

Subject: manipulating strings

From: Yair Altman

Date: 8 Jul, 2009 19:53:01

Message: 15 of 16

A few remarks:

> > > where did you find ismembc? is there a place to find undocumented functions?

One place to look is my blog: http://UndocumentedMatlab.com
Another place is this forum.
Yet another is Matlab's own files, as noted by Us.

> > it is not an undocumented function...
> > rather, look at ISMEMBER...[snip]

Actually, ismembc is an example of an internal helper function that is neither fully documented nor supported by TheMathWorks.

> > it's often worthwhile to look at ML stock functions to
> > - see how TMW does things (not always optimized...)
> > - look for hidden gems...
> >
> > us

This is true - most of what I've ever found about undocumented stuff in Matlab comes from Matlab's own source-code, which is part of the official installation. Note that this does *NOT* imply official support by MathWorks. The rule-of-thum is that only something that appears in the online documentation (or the doc command) is officially supported.

> Uwe, yours also works like a charm. I personally dont see a difference
> between ismember and ismembc, at least on this machine! :)

The difference is quite evident within large loops and/or large arrays. See here: http://undocumentedmatlab.com/blog/ismembc-undocumented-helper-function/

Yair Altman
http://UndocumentedMatlab.com
 

Subject: manipulating strings

From: arun

Date: 9 Jul, 2009 10:11:21

Message: 16 of 16

On Jul 8, 7:50 pm, Loren Shure <lo...@mathworks.com> wrote:
> In article <d9639397-6d06-4e75-89fc-
> b6e988b1f...@h2g2000yqg.googlegroups.com>, aragorn1...@gmail.com says...
>
>
>
> > On Jul 8, 11:21 am, "us " <u...@neurol.unizh.ch> wrote:
> > > "nor ki"
>
> > > > where did you find ismembc? is there a place to find undocumented functions?
>
> > > it is not an undocumented function...
> > > rather, look at ISMEMBER
>
> > >      edit ismember;
> > > % and you'll find this at line #121 - 127
> > > %{
> > >       % Two C-Helper Functions are used in the code below:
>
> > >       % ISMEMBC  - S must be sorted - Returns logical vector indicating which
> > >       % elements of A occur in S
> > >       % ISMEMBC2 - S must be sorted - Returns a vector of the locations of
> > >       % the elements of A occurring in S.  If multiple instances occur,
> > >       % the last occurrence is returned          
> > > %}
> > > % then, being an investigative person, you'll immediately do this
> > >      which ismembc;
> > > % MLROOT\toolbox\matlab\ops\ismembc.mexw32     % <- a MEX...
> > > % and play with it in the command window (timing and so on)
>
> > > it's often worthwhile to look at ML stock functions to
> > > - see how TMW does things (not always optimized...)
> > > - look for hidden gems...
>
> > > us
>
> > nor ki,
>
> > thank you for your suggestions. They work very well. Now my next
> > formidable task is to reshape this vector to a 192*240605 matrix. (My
> > actual task is to parse a file which is 270 MB long line by line and
> > do these operations. But I found another topic in whice UWE has shown
> > the fastest way to read a whole file onto a variable using fread and
> > now, I am trying to remove the unwanted entries and then shape them in
> > to the desired matrix. The old line-by-line method takes about 30-45
> > mins on this old computer.. so far, before the reshape step, without
> > out of memory error, it takes 1.5 mins. let me see!! )
>
> > Uwe, yours also works like a charm. I personally dont see a difference
> > between ismember and ismembc, at least on this machine! :)
>
> > Rune, regexp and regexprep both give the "out of memory" error when
> > used on such longstringson my slowwww computer... (at my work).
> > I guess, it will be faster and able to be run on my new laptop...
> > still waiting ........
>
> > thanks again guys,
> > best,
> > arun.
>
> Help reshape.
>
> --
> Lorenhttp://blogs.mathworks.com/loren

When I meant, I have to "reshape" the arrays, I meant literally to use
the "reshape" function. :)
thank you,
Arun.

Tags for this Thread

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Contact us