From: <HIDDEN>
Newsgroups: comp.soft-sys.matlab
Subject: Re: buffer of sampling vectors
Date: Fri, 22 Apr 2011 10:01:04 +0000 (UTC)
Organization: Nuclear Research Institute
Lines: 39
Message-ID: <iorjl0$m13$>
References: <> <ionvb4$9dv$>
Reply-To: <HIDDEN>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Trace: 1303466464 22563 (22 Apr 2011 10:01:04 GMT)
NNTP-Posting-Date: Fri, 22 Apr 2011 10:01:04 +0000 (UTC)
X-Newsreader: MATLAB Central Newsreader 1322323
Xref: comp.soft-sys.matlab:723253

"Roger Stafford" wrote in message <ionvb4$9dv$>...
>   I am not sure what you mean by "trivial techniques", Michal.  Unless you envision the use of mex files, the ordinary functions of matlab are all that you have available for use.  The idea is to use these in the most efficient manner one can think of.
>   For a determination of equality status, lexicographically speaking, between two vectors v1 and v2 I would recommend doing this
>  f = find(v1~=v2,1,'first');
> If f comes up empty, the vectors are equal.  Otherwise, the 'find' scanning has not had to go past the first point of inequality.  In that case a subsequent comparison of the single elements v1(f) and v2(f) will reveal which way the inequality goes.
>   I would then recommend a binary search technique using this inequality method to determine in the least number of steps where, in an ordered "buffer" of vectors, a new vector is either already located or where it could be inserted so as to preserve its ordering.  There is nothing very sophisticated about such a technique, but it has the virtue of being of order O(n*log(m)) where n is the number of elements in the vectors and m is the number of vectors present in the buffer.
>   My final recommendation is that, in view of the large size of your vectors, you actually add new vectors to your buffer sequentially as they appear without regard to order, but instead maintain a set of sequential pointers which contain the index of the vector in the buffer that would be first in an ordered arrangement, then the second and so forth.  This gives your buffer a "virtual" ordering and avoids having to carry out massive shifts of data every time an insertion is to be made, and yet it still enables binary searching.  Only the pointer list would undergo a large shift when a virtual insertion is made.  If the time ever comes when you need the buffer to actually be ordered, this could be carried out comparatively efficiently using this pointer list, provided it doesn't occur very often.
>   If possible you should provide an initial memory allocation to your buffer with the 'zeros' function sufficiently large to accommodate all anticipated additions, and keep track of the number of vectors actually present with some counter.  Otherwise you run into serious delays memory-wise from overly frequent expansions of the buffer.
>   Of course if you merely want to order a fixed set of vectors lexicographically all at one time, the best way would be using the 'unique' function with the 'rows' option.  That guarantees that the vectors will be ordered and without repetition.  However that does not appear to be the problem you are faced with, and you don't want to find yourself repeatedly sorting the same set of vectors over and over again.  Hence the need for binary searching as the vectors are presented one at a time.
> Roger Stafford

Roger, thank you for your interesting hints and recomendations. First of all, I tried to find some simple and compact Matlab solution of my problem. The function "ismember" is probably very good candidate for this kind of problem.

Lets start with testing buffer respresented by matrix M:
m = 1000000;
n = 20;
M = randi([1 n],m,n,'uint8');

and then the function which is able to find the location of unique row vector x within matrix M may looks like:
testloc = 123456;
x = M(testloc,:);
[tf loc] = ismember(x,M,'rows');  

Is possible to find locations of multiple vectors, too:
[tf loc] = ismember([x;y;z],M,'rows');  

This solution is very elegant, but I need something faster. For large m > 1e6 and m>20 is typical response time about 1sec. In a case of multiple vectors is the response time about 10times longer.

Is there any chance to speedup this searching method based on "ismember" function?