Path: news.mathworks.com!newsfeed-00.mathworks.com!panix!bloom-beacon.mit.edu!senator-bedfellow.mit.edu!dreaderd!not-for-mail
From: Arthur G <gorramfreak+news@gmail.com>
Newsgroups: comp.soft-sys.matlab
Date: Fri, 22 Feb 2008 10:08:38 -0500
Message-ID: <47bee576$0$311$b45e6eb0@senator-bedfellow.mit.edu>
References: <fpkn42$3cg$1@fred.mathworks.com> <47bee1bc$0$294$b45e6eb0@senator-bedfellow.mit.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Subject: Re: Matching Character Phrases...
User-Agent: Unison/1.8
Lines: 36
NNTP-Posting-Host: WHITAKER-FOUR-SEVENTY-EIGHT.MIT.EDU
X-Trace: 1203692919 senator-bedfellow.mit.edu 311 18.56.6.223
Xref: news.mathworks.com comp.soft-sys.matlab:453148



On 2008-02-22 09:52:44 -0500, Arthur G <gorramfreak+news@gmail.com> said:
> 
> All of the current suggestions search through the string multiple 
> times. I think you should run through it once and collect information 
> along the way. If your text is always letters (no spaces or numbers), 
> you can use the words as dynamic field names to quickly "hash" the 
> various words. For example, the following code will create (1) a 
> structure of locations of each "word" and (2) a structure of distances 
> between multiple occurences of the words. However, this code could 
> become slow if you have *lots* of occurences of a particular word, 
> because it keeps "growing" arrays [in the line that uses (end+1)]. 
> Really, this problem would be much easier in a language that had more 
> flexible hashes/dictionaries and supported linked lists.
> 
> A = 'OPASKSGLBOJASLOPASNKMGLBOSDLASJSFLOPASHHASKSMLGLBO';
> num = 4;
> locationStruct = struct;
> for k=1:(numel(A)-num)
>     word = A(k:(k+num-1));
>     if isfield(locationStruct, word)
>         locationStruct.(word)(end+1) = k;
>     else
>         locationStruct.(word) = k;
>     end
> end
> distanceStruct = structfun(@diff, locationStruct, 'UniformOutput', 0);

I just wanted to point out a few more limitations to this approach:
(1) It's limited to 63-character word length, because that's 
namelengthmax in MATLAB
(2) It could also become memory intensive if the text is very long with 
very few word repeats, due to lots of fields in the structure. I don't 
know if there's a limit to the number of fields in a MATLAB structure.

But in the examples you've provided, this certainly isn't a problem...