From: "Skirt Zhang" <>
Newsgroups: comp.soft-sys.matlab
Subject: Re: shuffle without enough memory
Date: Mon, 12 Nov 2012 09:53:16 +0000 (UTC)
Organization: Universit&#233; de Lausanne
Lines: 93
Message-ID: <k7qguc$iic$>
References: <jcvfv8$6pt$> <jd5cmk$cmu$> <k7jh0a$bov$> <k7ph5v$7jj$>
Reply-To: "Skirt Zhang" <>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Trace: 1352713996 19020 (12 Nov 2012 09:53:16 GMT)
NNTP-Posting-Date: Mon, 12 Nov 2012 09:53:16 +0000 (UTC)
X-Newsreader: MATLAB Central Newsreader 2029090
Xref: comp.soft-sys.matlab:782590

"Roger Stafford" wrote in message <k7ph5v$7jj$>...
> "Skirt Zhang" wrote in message <k7jh0a$bov$>...
> > My problem is I have 1000 data in columns with real value Pi. The 1-1000 are column indices. I want to calculate the mean&median values for the triple products across all the columns, and also the bi-products of them, i.e. PiPjPk and PiPj, i j k are different from each other.
> > 
> > To make things faster and without violating the memory in Matlab, I calculate the possible 1e6 combinations of the column indices using the shuffle function. Then I calculate the products according to the shuffled index combinations.
> > 
> > I am wondering if there is an efficient and more accurate way to calculate the mean and median ? According to a previous suggestion from Roger, I may solve it as below cited text. but I have problems to implement it : 
> - - - - - - - -
>   Skirt Zhang, after reviewing your recent posting and your earlier one Dec. 26, 2011, I now understand that you have one thousand "return" values and you wish to find the precise median and mean for all possible triple products of distinct values among these.  In the codes below, the n=1000 return values are assumed to already be in a vector I call 'R'.  (For the much simpler mean value see the last paragraph below.)
>   There are two stages in the computation of the median.  First, an approximate distribution is found based on a Monte Carlo process in the array P1.  The larger P1 is, the more accurate this approximation.  Second, the approximate median as determined by P1 is used to choose an initial subrange of values from all possible triple products R(i)*R(j)*R(k), i<j<k, selected so as to fit within array P2.  Three counts are kept: 1) C1, the number of products below the subrange and therefore not included in P2, 2) C2, the number of products inserted into P2, and 3) C3, the number above the subrange and also not included.  From the final C1, C2, and C3 values, which must sum to M=n*(n-1)*(n-2)/6, it can be determined whether or not the median value lies within P2.  If it does, then the range of P2(1:C2) is sorted and that median evaluated.  If the median is not within P2, the subrange is 
> appropriately moved up or down and the whole process repeated.  It should be borne in mind that each time a new subrange has to be selected in this manner it requires that all n*(n-1)*(n-2)/6 = 166,167,000 ordered triple products be generated, so an appreciable amount of computation time will be required for each such step. 
>   The parameter 'f' must be set less than 1.  It sets the fraction of the P2 array that would theoretically be used if the distribution in P1 were perfect, so by setting it less than 1 this helps guard against potential overflow of P2.  You may want to reduce 'f' if you encounter trouble with P2 overflow or to increase it if it proves to be too cautious.  The parameter 'f2' should be slightly less than 2 and provides a small overlap between successive subranges whenever more than one is necessary.
>   The following code is somewhat crude and could probably be improved upon with more intelligent adjustment of successive subranges or with better handling of P2 overflow, but this is all I have time for now.  I have set very large sizes, N1 and N2, for the two arrays, P1 and P2, in view of your statement that you had 10e6 memory available.  You may have to adjust these in accordance with your particular needs, but remember, larger is better!
> n = length(R); % = 1000
> N1 = 1e6; P1 = zeros(N1,1); % Allocate huge arrays
> N2 = 2e6; P2 = zeros(N2,1);
> M = n*(n-1)*(n-2)/6; % The total number of possible ordered triple products
> f = 3/4; % Overflow protection for P2
> f2 = 1.9; % Small overlap provided between successive subranges
> % Get approximate distribution
> for i1 = 1:N1
>   t = ceil(n*rand(3,1));
>   while t(1)==t(2)||t(2)==t(3)||t(3)==t(1) % <-- The rejection process
>     t = ceil(n*rand(3,1)); % Try again
>   end
>   P1(i1) = prod(R(t));
> end
> P1 = sort(P1);
> % Approx. median would be (P1(floor((N1+1)/2))+P1(ceil((N1+1)/2)))/2
> % Now get exact median
> m = (N1+1)/2;
> w = f*N2*N1/M/2; % Half the width of the subrange
> b = true;
> while b
>   D = P1(max(ceil(m-w),1)); % Get lower and upper bounds on subrange
>   U = P1(min(floor(m+w),N1)); %  from the P1 array
>   C1 = 0; C2 = 0; C3 = 0; % Start all counts at zero
>   for i1 = 1:n-2 % Generate all M possible ordered triple products
>     for i2 = i1+1:n-1
>       p12 = R(i1)*R(i2);
>       for i3 = i2+1:n
>         p = p12*R(i3);
>         if p <= U
>           if p >= D
>             C2 = C2 + 1; % Product lies within subrange
>             if C2 > N2
>               error('Exceeded P2 array limit')
>             end
>             P2(C2) = p; % Append it to P2
>           else
>             C1 = C1 + 1; % Product is below subrange
>           end
>         else
>           C3 = C3  + 1; % Product is above subrange
>         end
>       end
>     end
>   end
>   if C1 > floor((M-1)/2)
>     m = m - f2*w;
>     fprintf('Adjust P2 subrange downwards\n')
>   elseif C3 > floor((M-1)/2)
>     m = m + f2*w;
>     fprintf('Adjust P2 subrange upwards\n')
>   else
>     b = false; % Median was within P2, exit while loop
>   end
> end
> P2(1:C2) = sort(P2(1:C2)); % Sort in preparation for median computation
> med = (P2(floor((M+1)/2)-C1)+P2(ceil((M+1)/2)-C1))/2; % The precise median
> fprintf('Median = %23.16e\n',med)
>   The computation for the mean is far, far simpler than the above and requires no large storage.  The mean value of triple products on R is equal to:
>  mv = (sum(R)^3-3*sum(R.^2)*sum(R)+2*sum(R.^3))/(n*(n-1)*(n-2));
>   (Note: If you want to check the validity of these algorithms, use short test R vectors with appropriately reduced P1 and P2 sizes and compare those results with a direct matlab median and mean computation of a fully stored set of triple products thereof.)
> Roger Stafford

Dear Roger,

Thanks so much for this detailed explanation, this really helps a lot!!!

For my project, I may have to obtain the permutation as well. Can you help me about this for calculating mean and median for those triple product?Thanks a lot in advance.

Best regardsĀ