Path: news.mathworks.com!not-for-mail
From: "Jason Breslau" <tendiamonds@mathworks.com>
Newsgroups: comp.soft-sys.matlab
Subject: Re: Matching Character Phrases...
Date: Sat, 23 Feb 2008 01:46:27 +0000 (UTC)
Organization: The MathWorks, Inc.
Lines: 132
Message-ID: <fpnttj$pk1$1@fred.mathworks.com>
References: <fpkn42$3cg$1@fred.mathworks.com>
Reply-To: "Jason Breslau" <tendiamonds@mathworks.com>
NNTP-Posting-Host: webapp-05-blr.mathworks.com
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 8bit
X-Trace: fred.mathworks.com 1203731187 26241 172.30.248.35 (23 Feb 2008 01:46:27 GMT)
X-Complaints-To: news@mathworks.com
NNTP-Posting-Date: Sat, 23 Feb 2008 01:46:27 +0000 (UTC)
X-Newsreader: MATLAB Central Newsreader 869473
Xref: news.mathworks.com comp.soft-sys.matlab:453267



This seems like a good time to use dynamic regular expressions.

Try this: 

>> text = 'OPASKSGLBOJASLOPASNKMGLBOSDLASJSFLOPASHHASKSMLGLBO';
>> len = num2str(4);
>> s = '';
>> regexp(text,
['(.{',len,'})((?>.*?\1))(?@s=sprintf(''%s%s[%d]\\n'',s,$1,length($2));)(?!)']);
>> s
s =
OPAS[14]
ASKS[38]
GLBO[15]
LOPA[20]
OPAS[20]
GLBO[25]

What did that do? 

 (.{',len,'}) - match exactly len characters, and capture it
in token #1

 ((?>.*?\1)) - match as few characters as possible, followed
by an exact match of token #1.  Make this group atomic (so
it won't continue to look for another match of token #1),
and capture this second portion in token #2

 (?@s=sprintf(''%s%s[%d]\\n'',s,$1,length($2));) - evaluate
this code in the MATLAB workspace.  This will append the
token #1 followed by the length of token #2 to the workspace
variable s, which was initialized to ''

 (?!) - force the regular expression to fail.  The purpose
of the expression is to do the evaluation, and then to keep
looking for more matches, so regexp will not actually return
any matches.

Alternately, you can collect the matches in a cell array in
the workspace, which may be easier to work with:

>> c={};
>> regexp(text,
['(.{',len,'})((?>.*?\1))(?@c{end+1}=sprintf(''%s[%d]\\n'',$1,length($2));)(?!)']);
>> c{:}
ans =
OPAS[14]

ans =
ASKS[38]

ans =
GLBO[15]

ans =
LOPA[20]

ans =
OPAS[20]

ans =
GLBO[25]

Or:

>> c={};
>> regexp(text, ['(.{',len,'})((?>.*?\1))(?@c{end+1}={$1,
length($2)};)(?!)']);
>> c{:}
ans = 
    'OPAS'    [14]
ans = 
    'ASKS'    [38]
ans = 
    'GLBO'    [15]
ans = 
    'LOPA'    [20]
ans = 
    'OPAS'    [20]
ans = 
    'GLBO'    [25]

Hope that helps,

-=>J


"Jack Branning" <jbr.nospam@nospam.com> wrote in message
<fpkn42$3cg$1@fred.mathworks.com>...
> Hi
> 
> Can anyone help me with figuring out what kind of loop
would solve this 
> problem?
> 
> I have a variable 'text' that is a series of uppercase
characters.  It looks 
> something like this:
> 
> OPASKSGLBOJASLOPASNKMGLBOSDLASJSFLOPASHHASKSMLGLBO...
> 
> The user enters a value, and based on this number, the
program should look 
> for all matching phrases of that length.  For example, if
they choose '4' the 
> loop should look through 'text' for all phrases of this
size that occur more 
> than once.  It should also record the distance between the
matching phrases 
> in another row of the array (or a seperate array if this
is easier). The output 
> array for the above 'text' should end up looking something
like this:
> 
> OPAS  [14]
> OPAS  [20]
> GLBO  [15]
> GLBO  [23]
> ...etc...
> 
> Using strmatch doesnt seem to help me for this...
> 
> I have a loop that works, but it is very time consuming to
run.  I only really 
> need to use the first '30' results from the output array
so it would be ideal if 
> the loop could break when the output array is of length 30
(if it gets up to 30, 
> sometimes there will be less), otherwise it should end
when there are no 
> more matches found.