Path: news.mathworks.com!newsfeed-00.mathworks.com!panix!bloom-beacon.mit.edu!senator-bedfellow.mit.edu!dreaderd!not-for-mail
From: Arthur G <gorramfreak+news@gmail.com>
Newsgroups: comp.soft-sys.matlab
Date: Fri, 22 Feb 2008 10:36:20 -0500
Message-ID: <47beebf4$0$287$b45e6eb0@senator-bedfellow.mit.edu>
References: <fpkn42$3cg$1@fred.mathworks.com> <47bee1bc$0$294$b45e6eb0@senator-bedfellow.mit.edu> <fpmp0i$fi6$1@fred.mathworks.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Subject: Re: Matching Character Phrases...
User-Agent: Unison/1.8
Lines: 60
NNTP-Posting-Host: WHITAKER-FOUR-SEVENTY-EIGHT.MIT.EDU
X-Trace: 1203694580 senator-bedfellow.mit.edu 287 18.56.6.223
Xref: news.mathworks.com comp.soft-sys.matlab:453165



On 2008-02-22 10:16:34 -0500, "Jack Branning" <jbr.nospam@nospam.com> said:
> 
>> 
>> All of the current suggestions search through the string multiple
>> times. I think you should run through it once and collect information
>> along the way. If your text is always letters (no spaces or numbers),
>> you can use the words as dynamic field names to quickly "hash" the
>> various words. For example, the following code will create (1) a
>> structure of locations of each "word" and (2) a structure of distances
>> between multiple occurences of the words. However, this code could
>> become slow if you have *lots* of occurences of a particular word,
>> because it keeps "growing" arrays [in the line that uses (end+1)].
>> Really, this problem would be much easier in a language that had more
>> flexible hashes/dictionaries and supported linked lists.
>> 
>> A = 'OPASKSGLBOJASLOPASNKMGLBOSDLASJSFLOPASHHASKSMLGLBO';
>> num = 4;
>> locationStruct = struct;
>> for k=1:(numel(A)-num)
>> word = A(k:(k+num-1));
>> if isfield(locationStruct, word)
>> locationStruct.(word)(end+1) = k;
>> else
>> locationStruct.(word) = k;
>> end
>> end
>> distanceStruct = structfun(@diff, locationStruct, 'UniformOutput', 0);

> Hi, thank you so much for your help, I think this solution is very 
> close to what
> I need.
> 
> However, I do have a couple of questions:
> How can I build an array of just the 'words' that are repeated in A?
> and how can I build another array that shows the distances between
> matching pairs?
> 
> This solution is 1000% quicker than my solution, so I am very interested in
> hearing how I can put it into practise!
> 
> Thanks again!

Once you have locationStruct and distanceStruct, there are lots of ways to
create the arrays. What's most efficient depends on the number of "single"
words, but here's what I think is a relatively robust solution:

numWords = sum( ~structfun(@numel, locationStruct) );
wordList = cell(numWords, 1);
distanceList = zeros(numWords, 1);
count = 0;
fn = fieldnames(distanceStruct);
for i=1:numel(fn)
    word = fn{i};
    for distance=distanceStruct.(word)
        count = count + 1;
        wordList{count} = word;
        distanceList(count) = distance;
    end
end