Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Thread Subject:
shuffle without enough memory

Subject: shuffle without enough memory

From: Skirt Zhang

Date: 22 Dec, 2011 14:48:08

Message: 1 of 9

Dear All,



I have used this shuffle function for my problem:

I need to form 10^9 combinations by randomly select three from 1000 integers (1.....1000 refers to the firm identity). After finding the combinations, I refer to single firms' return and calculate the product of the three and get one value, I call it "Prod". So I should have 10^9 "Prod"s. Finally I need to calculate the median value for all these Prods.

Thanks to Jan Simon,
http://www.mathworks.com/matlabcentral/fileexchange/27076-shuffle
Using shuffle, I manage to do this. However, the memory in my PC can only allow me to have 10^7 combinations, and the result is not good given the limit of nr of the memory.

Can anyone suggest me a good way to find the median value from the 10^7 prods?

Instead of telling me to change the computer for larger space?

Thanks a lot in advance and wish u all a nice holiday,

Subject: shuffle without enough memory

From: Roger Stafford

Date: 24 Dec, 2011 20:29:08

Message: 2 of 9

"Skirt Zhang" wrote in message <jcvfv8$6pt$1@newscl01ah.mathworks.com>...
> I need to form 10^9 combinations by randomly select three from 1000 integers (1.....1000 refers to the firm identity). After finding the combinations, I refer to single firms' return and calculate the product of the three and get one value, I call it "Prod". So I should have 10^9 "Prod"s. Finally I need to calculate the median value for all these Prods.
> .......
> Can anyone suggest me a good way to find the median value from the 10^7 prods?
- - - - - - - - -
  It looks as though you are using a "Monte Carlo" method of approximating the median for your random process, but apparently you don't have enough memory to hold a sufficiently representative number of samples. If the median is the only thing you wish to find, there is a way of determining it precisely (as opposed to a Monte Carlo approximation) without providing arrays that are large enough to hold the products of all possible combinations. However it does require that you have a rough initial estimate of that median value.

  Suppose that of the 1000*999*998/6 = 166167000 total number of combinations of 1000 numbers taken three at a time, you have good reason to believe that the median of their products is a value somewhere between about 65 million and 75 million. Using only the 10^7 length array that you have stated would fit in your memory, it is possible to find the precise median of the products of all those combinations, provided this guess is correct. You can proceed along the following lines:

 L = 65e6;
 S = 1e7;
 C = zeros(S,1);
 N = 1000;
 for i1 = 1:N-2
  for i2 = i1+1:N-1
   p12 = i1*i2;
   for i3 = i2+1:N
    p123 = min(max(p12*i3-L+1,1),S);
    C(p123) = C(p123)+1;
   end
  end
 end
 C = cumsum(C);

  Taking into account its "L-1" offset, the array C now contains all the necessary information for obtaining the desired median, provided of course that the above guess was correct. You are looking for the first occurrence in C of a cumulative count that is greater than or equal to the lower midpoint, floor(C(end)/2), (which is of course 83083500,) and with a correct guess above, that should occur somewhere between C(2) and C(end-1). (Since 166167000 is even you are also looking for the first occurrence greater than or equal to 83083501 to get a valid median.) I will leave the messy details of doing all that to you. (If need be, these details can be expanded upon in subsequent postings.)

  As to justifying the above guess, one can reason as follows. First, N=1000 is sufficiently large that we can include cases where the numbers are repeated among the triplets without affecting the resulting median seriously. Moreover for the same reason we can assume a continuum of values between 0 and 1000 instead of discrete integers. This makes it a problem in calculus. Thus we then have a constant probability density of 1/N^3 in R^3 inside the cube 0<=x<=N, 0<=y<=N, and 0<=z<=N. We want the median of the product f(x,y,z) = x*y*z in that cube. The cumulative distribution function (CDF) is:

 F(c) = Pr(x*y*z<=c) = 1 - Pr(x*y*z>c) = 1 - Pr(u*v*w>t)

where u = x/N, v = y/N, w = z/N, and t = c/N^3 and where the density of u, v, w over its unit cube is just 1. We can obtain an exact answer for the latter quantity (courtesy of the Symbolic Toolbox):

 F(c) = 1-int(int(int(1,'w',t/(u*v),1),'v',t/u,1),'u',t,1) =
  t*(1-log(t)+1/2*log(t)^2)

The value of t for which this expression equals 1/2 (for the median) is:

 t = .6897160962790907e-1

so that c at the median would be c = t*N^3 = 68971609.62790907 .

  That is the justification for using the above estimate between 65 and 75 millions. Note that, depending on how accurate this estimate is, one can use shorter arrays than above if necessary. Of course if an estimate should prove to be sufficiently in error that the above nested for-loops fail to entrap the median in C(2:end-1), appropriate adjustments can be made with the value 'L' for other tries.

Roger Stafford

Subject: shuffle without enough memory

From: Skirt Zhang

Date: 25 Dec, 2011 10:57:09

Message: 3 of 9

Thanks a lot Roger!??????????????? ?????

Wish you a very nice holiday and merry christmas :)


"Roger Stafford" wrote in message <jd5cmk$cmu$1@newscl01ah.mathworks.com>...
> "Skirt Zhang" wrote in message <jcvfv8$6pt$1@newscl01ah.mathworks.com>...
> > I need to form 10^9 combinations by randomly select three from 1000 integers (1.....1000 refers to the firm identity). After finding the combinations, I refer to single firms' return and calculate the product of the three and get one value, I call it "Prod". So I should have 10^9 "Prod"s. Finally I need to calculate the median value for all these Prods.
> > .......
> > Can anyone suggest me a good way to find the median value from the 10^7 prods?
> - - - - - - - - -
> It looks as though you are using a "Monte Carlo" method of approximating the median for your random process, but apparently you don't have enough memory to hold a sufficiently representative number of samples. If the median is the only thing you wish to find, there is a way of determining it precisely (as opposed to a Monte Carlo approximation) without providing arrays that are large enough to hold the products of all possible combinations. However it does require that you have a rough initial estimate of that median value.
>
> Suppose that of the 1000*999*998/6 = 166167000 total number of combinations of 1000 numbers taken three at a time, you have good reason to believe that the median of their products is a value somewhere between about 65 million and 75 million. Using only the 10^7 length array that you have stated would fit in your memory, it is possible to find the precise median of the products of all those combinations, provided this guess is correct. You can proceed along the following lines:
>
> L = 65e6;
> S = 1e7;
> C = zeros(S,1);
> N = 1000;
> for i1 = 1:N-2
> for i2 = i1+1:N-1
> p12 = i1*i2;
> for i3 = i2+1:N
> p123 = min(max(p12*i3-L+1,1),S);
> C(p123) = C(p123)+1;
> end
> end
> end
> C = cumsum(C);
>
> Taking into account its "L-1" offset, the array C now contains all the necessary information for obtaining the desired median, provided of course that the above guess was correct. You are looking for the first occurrence in C of a cumulative count that is greater than or equal to the lower midpoint, floor(C(end)/2), (which is of course 83083500,) and with a correct guess above, that should occur somewhere between C(2) and C(end-1). (Since 166167000 is even you are also looking for the first occurrence greater than or equal to 83083501 to get a valid median.) I will leave the messy details of doing all that to you. (If need be, these details can be expanded upon in subsequent postings.)
>
> As to justifying the above guess, one can reason as follows. First, N=1000 is sufficiently large that we can include cases where the numbers are repeated among the triplets without affecting the resulting median seriously. Moreover for the same reason we can assume a continuum of values between 0 and 1000 instead of discrete integers. This makes it a problem in calculus. Thus we then have a constant probability density of 1/N^3 in R^3 inside the cube 0<=x<=N, 0<=y<=N, and 0<=z<=N. We want the median of the product f(x,y,z) = x*y*z in that cube. The cumulative distribution function (CDF) is:
>
> F(c) = Pr(x*y*z<=c) = 1 - Pr(x*y*z>c) = 1 - Pr(u*v*w>t)
>
> where u = x/N, v = y/N, w = z/N, and t = c/N^3 and where the density of u, v, w over its unit cube is just 1. We can obtain an exact answer for the latter quantity (courtesy of the Symbolic Toolbox):
>
> F(c) = 1-int(int(int(1,'w',t/(u*v),1),'v',t/u,1),'u',t,1) =
> t*(1-log(t)+1/2*log(t)^2)
>
> The value of t for which this expression equals 1/2 (for the median) is:
>
> t = .6897160962790907e-1
>
> so that c at the median would be c = t*N^3 = 68971609.62790907 .
>
> That is the justification for using the above estimate between 65 and 75 millions. Note that, depending on how accurate this estimate is, one can use shorter arrays than above if necessary. Of course if an estimate should prove to be sufficiently in error that the above nested for-loops fail to entrap the median in C(2:end-1), appropriate adjustments can be made with the value 'L' for other tries.
>
> Roger Stafford

Subject: shuffle without enough memory

From: Skirt Zhang

Date: 26 Dec, 2011 13:20:09

Message: 4 of 9

"Roger Stafford" wrote in message <jd5cmk$cmu$1@newscl01ah.mathworks.com>...
> "Skirt Zhang" wrote in message <jcvfv8$6pt$1@newscl01ah.mathworks.com>...
> > I need to form 10^9 combinations by randomly select three from 1000 integers (1.....1000 refers to the firm identity). After finding the combinations, I refer to single firms' return and calculate the product of the three and get one value, I call it "Prod". So I should have 10^9 "Prod"s. Finally I need to calculate the median value for all these Prods.
> > .......
> > Can anyone suggest me a good way to find the median value from the 10^7 prods?
> - - - - - - - - -
> It looks as though you are using a "Monte Carlo" method of approximating the median for your random process, but apparently you don't have enough memory to hold a sufficiently representative number of samples. If the median is the only thing you wish to find, there is a way of determining it precisely (as opposed to a Monte Carlo approximation) without providing arrays that are large enough to hold the products of all possible combinations. However it does require that you have a rough initial estimate of that median value.
>
> Suppose that of the 1000*999*998/6 = 166167000 total number of combinations of 1000 numbers taken three at a time, you have good reason to believe that the median of their products is a value somewhere between about 65 million and 75 million. Using only the 10^7 length array that you have stated would fit in your memory, it is possible to find the precise median of the products of all those combinations, provided this guess is correct. You can proceed along the following lines:
>
> L = 65e6;
> S = 1e7;
> C = zeros(S,1);
> N = 1000;
> for i1 = 1:N-2
> for i2 = i1+1:N-1
> p12 = i1*i2;
> for i3 = i2+1:N
> p123 = min(max(p12*i3-L+1,1),S);
> C(p123) = C(p123)+1;
> end
> end
> end
> C = cumsum(C);
>
> Taking into account its "L-1" offset, the array C now contains all the necessary information for obtaining the desired median, provided of course that the above guess was correct. You are looking for the first occurrence in C of a cumulative count that is greater than or equal to the lower midpoint, floor(C(end)/2), (which is of course 83083500,) and with a correct guess above, that should occur somewhere between C(2) and C(end-1). (Since 166167000 is even you are also looking for the first occurrence greater than or equal to 83083501 to get a valid median.) I will leave the messy details of doing all that to you. (If need be, these details can be expanded upon in subsequent postings.)
>
> As to justifying the above guess, one can reason as follows. First, N=1000 is sufficiently large that we can include cases where the numbers are repeated among the triplets without affecting the resulting median seriously. Moreover for the same reason we can assume a continuum of values between 0 and 1000 instead of discrete integers. This makes it a problem in calculus. Thus we then have a constant probability density of 1/N^3 in R^3 inside the cube 0<=x<=N, 0<=y<=N, and 0<=z<=N. We want the median of the product f(x,y,z) = x*y*z in that cube. The cumulative distribution function (CDF) is:
>
> F(c) = Pr(x*y*z<=c) = 1 - Pr(x*y*z>c) = 1 - Pr(u*v*w>t)
>
> where u = x/N, v = y/N, w = z/N, and t = c/N^3 and where the density of u, v, w over its unit cube is just 1. We can obtain an exact answer for the latter quantity (courtesy of the Symbolic Toolbox):
>
> F(c) = 1-int(int(int(1,'w',t/(u*v),1),'v',t/u,1),'u',t,1) =
> t*(1-log(t)+1/2*log(t)^2)
>
> The value of t for which this expression equals 1/2 (for the median) is:
>
> t = .6897160962790907e-1
>
> so that c at the median would be c = t*N^3 = 68971609.62790907 .
>
> That is the justification for using the above estimate between 65 and 75 millions. Note that, depending on how accurate this estimate is, one can use shorter arrays than above if necessary. Of course if an estimate should prove to be sufficiently in error that the above nested for-loops fail to entrap the median in C(2:end-1), appropriate adjustments can be made with the value 'L' for other tries.
>
> Roger Stafford


Hi Roger,

I am trying to implement your idea. However, for the 1 to 1000 integers I just use them for indexing/representing the firm's identity, i.e. 2 means firm 2's return. So I need to find the median for all the Prods' of 1000 returns, e.g. R1R2R3.....

Does this procedure mean you are calculating the Prod's of any 3 integers from 1:1000?
 p123 = min(max(p12*i3-L+1,1),S);
   C(p123) = C(p123)+1;

Subject: shuffle without enough memory

From: Skirt Zhang

Date: 9 Nov, 2012 18:11:22

Message: 5 of 9

Hi Roger,

I have difficulty to implement your suggestions.
My problem is I have 1000 data in columns with real value Pi. The 1-1000 are column indices. I want to calculate the mean&median values for the triple products across all the columns, and also the bi-products of them, i.e. PiPjPk and PiPj, i j k are different from each other.

To make things faster and without violating the memory in Matlab, I calculate the possible 1e6 combinations of the column indices using the shuffle function. Then I calculate the products according to the shuffled index combinations.

I am wondering if there is an efficient and more accurate way to calculate the mean and median ? According to a previous suggestion from Roger, I may solve it as below cited text. but I have problems to implement it :

First, I generate 1e7 numbers but I found

floor(C(end)/2), (which is of course 83083500,)

this floor 83083500 is bigger than my 1e7 simulated numbers. So How can I proceed using your approach?

If I need to find the triple product value greater or equal to the product value at 83083500, I need to calculate the product values for 1000*999*998/6=166167000. This seems not saving much effort...
Any suggestions are highly appreciated,

Best regards,
"Roger Stafford" wrote in message <jd5cmk$cmu$1@newscl01ah.mathworks.com>...
> "Skirt Zhang" wrote in message <jcvfv8$6pt$1@newscl01ah.mathworks.com>...
> > I need to form 10^9 combinations by randomly select three from 1000 integers (1.....1000 refers to the firm identity). After finding the combinations, I refer to single firms' return and calculate the product of the three and get one value, I call it "Prod". So I should have 10^9 "Prod"s. Finally I need to calculate the median value for all these Prods.
> > .......
> > Can anyone suggest me a good way to find the median value from the 10^7 prods?
> - - - - - - - - -
> It looks as though you are using a "Monte Carlo" method of approximating the median for your random process, but apparently you don't have enough memory to hold a sufficiently representative number of samples. If the median is the only thing you wish to find, there is a way of determining it precisely (as opposed to a Monte Carlo approximation) without providing arrays that are large enough to hold the products of all possible combinations. However it does require that you have a rough initial estimate of that median value.
>
> Suppose that of the 1000*999*998/6 = 166167000 total number of combinations of 1000 numbers taken three at a time, you have good reason to believe that the median of their products is a value somewhere between about 65 million and 75 million. Using only the 10^7 length array that you have stated would fit in your memory, it is possible to find the precise median of the products of all those combinations, provided this guess is correct. You can proceed along the following lines:
>
> L = 65e6;
> S = 1e7;
> C = zeros(S,1);
> N = 1000;
> for i1 = 1:N-2
> for i2 = i1+1:N-1
> p12 = i1*i2;
> for i3 = i2+1:N
> p123 = min(max(p12*i3-L+1,1),S);
> C(p123) = C(p123)+1;
> end
> end
> end
> C = cumsum(C);
>
> Taking into account its "L-1" offset, the array C now contains all the necessary information for obtaining the desired median, provided of course that the above guess was correct. You are looking for the first occurrence in C of a cumulative count that is greater than or equal to the lower midpoint, floor(C(end)/2), (which is of course 83083500,) and with a correct guess above, that should occur somewhere between C(2) and C(end-1). (Since 166167000 is even you are also looking for the first occurrence greater than or equal to 83083501 to get a valid median.) I will leave the messy details of doing all that to you. (If need be, these details can be expanded upon in subsequent postings.)
>
> As to justifying the above guess, one can reason as follows. First, N=1000 is sufficiently large that we can include cases where the numbers are repeated among the triplets without affecting the resulting median seriously. Moreover for the same reason we can assume a continuum of values between 0 and 1000 instead of discrete integers. This makes it a problem in calculus. Thus we then have a constant probability density of 1/N^3 in R^3 inside the cube 0<=x<=N, 0<=y<=N, and 0<=z<=N. We want the median of the product f(x,y,z) = x*y*z in that cube. The cumulative distribution function (CDF) is:
>
> F(c) = Pr(x*y*z<=c) = 1 - Pr(x*y*z>c) = 1 - Pr(u*v*w>t)
>
> where u = x/N, v = y/N, w = z/N, and t = c/N^3 and where the density of u, v, w over its unit cube is just 1. We can obtain an exact answer for the latter quantity (courtesy of the Symbolic Toolbox):
>
> F(c) = 1-int(int(int(1,'w',t/(u*v),1),'v',t/u,1),'u',t,1) =
> t*(1-log(t)+1/2*log(t)^2)
>
> The value of t for which this expression equals 1/2 (for the median) is:
>
> t = .6897160962790907e-1
>
> so that c at the median would be c = t*N^3 = 68971609.62790907 .
>
> That is the justification for using the above estimate between 65 and 75 millions. Note that, depending on how accurate this estimate is, one can use shorter arrays than above if necessary. Of course if an estimate should prove to be sufficiently in error that the above nested for-loops fail to entrap the median in C(2:end-1), appropriate adjustments can be made with the value 'L' for other tries.
>
> Roger Stafford

Subject: shuffle without enough memory

From: Roger Stafford

Date: 12 Nov, 2012 00:51:11

Message: 6 of 9

"Skirt Zhang" wrote in message <k7jh0a$bov$1@newscl01ah.mathworks.com>...
> My problem is I have 1000 data in columns with real value Pi. The 1-1000 are column indices. I want to calculate the mean&median values for the triple products across all the columns, and also the bi-products of them, i.e. PiPjPk and PiPj, i j k are different from each other.
>
> To make things faster and without violating the memory in Matlab, I calculate the possible 1e6 combinations of the column indices using the shuffle function. Then I calculate the products according to the shuffled index combinations.
>
> I am wondering if there is an efficient and more accurate way to calculate the mean and median ? According to a previous suggestion from Roger, I may solve it as below cited text. but I have problems to implement it :
- - - - - - - -
  Skirt Zhang, after reviewing your recent posting and your earlier one Dec. 26, 2011, I now understand that you have one thousand "return" values and you wish to find the precise median and mean for all possible triple products of distinct values among these. In the codes below, the n=1000 return values are assumed to already be in a vector I call 'R'. (For the much simpler mean value see the last paragraph below.)

  There are two stages in the computation of the median. First, an approximate distribution is found based on a Monte Carlo process in the array P1. The larger P1 is, the more accurate this approximation. Second, the approximate median as determined by P1 is used to choose an initial subrange of values from all possible triple products R(i)*R(j)*R(k), i<j<k, selected so as to fit within array P2. Three counts are kept: 1) C1, the number of products below the subrange and therefore not included in P2, 2) C2, the number of products inserted into P2, and 3) C3, the number above the subrange and also not included. From the final C1, C2, and C3 values, which must sum to M=n*(n-1)*(n-2)/6, it can be determined whether or not the median value lies within P2. If it does, then the range of P2(1:C2) is sorted and that median evaluated. If the median is not within P2, the subrange is
appropriately moved up or down and the whole process repeated. It should be borne in mind that each time a new subrange has to be selected in this manner it requires that all n*(n-1)*(n-2)/6 = 166,167,000 ordered triple products be generated, so an appreciable amount of computation time will be required for each such step.

  The parameter 'f' must be set less than 1. It sets the fraction of the P2 array that would theoretically be used if the distribution in P1 were perfect, so by setting it less than 1 this helps guard against potential overflow of P2. You may want to reduce 'f' if you encounter trouble with P2 overflow or to increase it if it proves to be too cautious. The parameter 'f2' should be slightly less than 2 and provides a small overlap between successive subranges whenever more than one is necessary.

  The following code is somewhat crude and could probably be improved upon with more intelligent adjustment of successive subranges or with better handling of P2 overflow, but this is all I have time for now. I have set very large sizes, N1 and N2, for the two arrays, P1 and P2, in view of your statement that you had 10e6 memory available. You may have to adjust these in accordance with your particular needs, but remember, larger is better!

n = length(R); % = 1000
N1 = 1e6; P1 = zeros(N1,1); % Allocate huge arrays
N2 = 2e6; P2 = zeros(N2,1);
M = n*(n-1)*(n-2)/6; % The total number of possible ordered triple products
f = 3/4; % Overflow protection for P2
f2 = 1.9; % Small overlap provided between successive subranges

% Get approximate distribution
for i1 = 1:N1
  t = ceil(n*rand(3,1));
  while t(1)==t(2)||t(2)==t(3)||t(3)==t(1) % <-- The rejection process
    t = ceil(n*rand(3,1)); % Try again
  end
  P1(i1) = prod(R(t));
end
P1 = sort(P1);
% Approx. median would be (P1(floor((N1+1)/2))+P1(ceil((N1+1)/2)))/2

% Now get exact median
m = (N1+1)/2;
w = f*N2*N1/M/2; % Half the width of the subrange
b = true;
while b
  D = P1(max(ceil(m-w),1)); % Get lower and upper bounds on subrange
  U = P1(min(floor(m+w),N1)); % from the P1 array
  C1 = 0; C2 = 0; C3 = 0; % Start all counts at zero
  for i1 = 1:n-2 % Generate all M possible ordered triple products
    for i2 = i1+1:n-1
      p12 = R(i1)*R(i2);
      for i3 = i2+1:n
        p = p12*R(i3);
        if p <= U
          if p >= D
            C2 = C2 + 1; % Product lies within subrange
            if C2 > N2
              error('Exceeded P2 array limit')
            end
            P2(C2) = p; % Append it to P2
          else
            C1 = C1 + 1; % Product is below subrange
          end
        else
          C3 = C3 + 1; % Product is above subrange
        end
      end
    end
  end
  if C1 > floor((M-1)/2)
    m = m - f2*w;
    fprintf('Adjust P2 subrange downwards\n')
  elseif C3 > floor((M-1)/2)
    m = m + f2*w;
    fprintf('Adjust P2 subrange upwards\n')
  else
    b = false; % Median was within P2, exit while loop
  end
end
P2(1:C2) = sort(P2(1:C2)); % Sort in preparation for median computation
med = (P2(floor((M+1)/2)-C1)+P2(ceil((M+1)/2)-C1))/2; % The precise median
fprintf('Median = %23.16e\n',med)

  The computation for the mean is far, far simpler than the above and requires no large storage. The mean value of triple products on R is equal to:

 mv = (sum(R)^3-3*sum(R.^2)*sum(R)+2*sum(R.^3))/(n*(n-1)*(n-2));

  (Note: If you want to check the validity of these algorithms, use short test R vectors with appropriately reduced P1 and P2 sizes and compare those results with a direct matlab median and mean computation of a fully stored set of triple products thereof.)

Roger Stafford

Subject: shuffle without enough memory

From: Skirt Zhang

Date: 12 Nov, 2012 09:53:16

Message: 7 of 9

"Roger Stafford" wrote in message <k7ph5v$7jj$1@newscl01ah.mathworks.com>...
> "Skirt Zhang" wrote in message <k7jh0a$bov$1@newscl01ah.mathworks.com>...
> > My problem is I have 1000 data in columns with real value Pi. The 1-1000 are column indices. I want to calculate the mean&median values for the triple products across all the columns, and also the bi-products of them, i.e. PiPjPk and PiPj, i j k are different from each other.
> >
> > To make things faster and without violating the memory in Matlab, I calculate the possible 1e6 combinations of the column indices using the shuffle function. Then I calculate the products according to the shuffled index combinations.
> >
> > I am wondering if there is an efficient and more accurate way to calculate the mean and median ? According to a previous suggestion from Roger, I may solve it as below cited text. but I have problems to implement it :
> - - - - - - - -
> Skirt Zhang, after reviewing your recent posting and your earlier one Dec. 26, 2011, I now understand that you have one thousand "return" values and you wish to find the precise median and mean for all possible triple products of distinct values among these. In the codes below, the n=1000 return values are assumed to already be in a vector I call 'R'. (For the much simpler mean value see the last paragraph below.)
>
> There are two stages in the computation of the median. First, an approximate distribution is found based on a Monte Carlo process in the array P1. The larger P1 is, the more accurate this approximation. Second, the approximate median as determined by P1 is used to choose an initial subrange of values from all possible triple products R(i)*R(j)*R(k), i<j<k, selected so as to fit within array P2. Three counts are kept: 1) C1, the number of products below the subrange and therefore not included in P2, 2) C2, the number of products inserted into P2, and 3) C3, the number above the subrange and also not included. From the final C1, C2, and C3 values, which must sum to M=n*(n-1)*(n-2)/6, it can be determined whether or not the median value lies within P2. If it does, then the range of P2(1:C2) is sorted and that median evaluated. If the median is not within P2, the subrange is
> appropriately moved up or down and the whole process repeated. It should be borne in mind that each time a new subrange has to be selected in this manner it requires that all n*(n-1)*(n-2)/6 = 166,167,000 ordered triple products be generated, so an appreciable amount of computation time will be required for each such step.
>
> The parameter 'f' must be set less than 1. It sets the fraction of the P2 array that would theoretically be used if the distribution in P1 were perfect, so by setting it less than 1 this helps guard against potential overflow of P2. You may want to reduce 'f' if you encounter trouble with P2 overflow or to increase it if it proves to be too cautious. The parameter 'f2' should be slightly less than 2 and provides a small overlap between successive subranges whenever more than one is necessary.
>
> The following code is somewhat crude and could probably be improved upon with more intelligent adjustment of successive subranges or with better handling of P2 overflow, but this is all I have time for now. I have set very large sizes, N1 and N2, for the two arrays, P1 and P2, in view of your statement that you had 10e6 memory available. You may have to adjust these in accordance with your particular needs, but remember, larger is better!
>
> n = length(R); % = 1000
> N1 = 1e6; P1 = zeros(N1,1); % Allocate huge arrays
> N2 = 2e6; P2 = zeros(N2,1);
> M = n*(n-1)*(n-2)/6; % The total number of possible ordered triple products
> f = 3/4; % Overflow protection for P2
> f2 = 1.9; % Small overlap provided between successive subranges
>
> % Get approximate distribution
> for i1 = 1:N1
> t = ceil(n*rand(3,1));
> while t(1)==t(2)||t(2)==t(3)||t(3)==t(1) % <-- The rejection process
> t = ceil(n*rand(3,1)); % Try again
> end
> P1(i1) = prod(R(t));
> end
> P1 = sort(P1);
> % Approx. median would be (P1(floor((N1+1)/2))+P1(ceil((N1+1)/2)))/2
>
> % Now get exact median
> m = (N1+1)/2;
> w = f*N2*N1/M/2; % Half the width of the subrange
> b = true;
> while b
> D = P1(max(ceil(m-w),1)); % Get lower and upper bounds on subrange
> U = P1(min(floor(m+w),N1)); % from the P1 array
> C1 = 0; C2 = 0; C3 = 0; % Start all counts at zero
> for i1 = 1:n-2 % Generate all M possible ordered triple products
> for i2 = i1+1:n-1
> p12 = R(i1)*R(i2);
> for i3 = i2+1:n
> p = p12*R(i3);
> if p <= U
> if p >= D
> C2 = C2 + 1; % Product lies within subrange
> if C2 > N2
> error('Exceeded P2 array limit')
> end
> P2(C2) = p; % Append it to P2
> else
> C1 = C1 + 1; % Product is below subrange
> end
> else
> C3 = C3 + 1; % Product is above subrange
> end
> end
> end
> end
> if C1 > floor((M-1)/2)
> m = m - f2*w;
> fprintf('Adjust P2 subrange downwards\n')
> elseif C3 > floor((M-1)/2)
> m = m + f2*w;
> fprintf('Adjust P2 subrange upwards\n')
> else
> b = false; % Median was within P2, exit while loop
> end
> end
> P2(1:C2) = sort(P2(1:C2)); % Sort in preparation for median computation
> med = (P2(floor((M+1)/2)-C1)+P2(ceil((M+1)/2)-C1))/2; % The precise median
> fprintf('Median = %23.16e\n',med)
>
> The computation for the mean is far, far simpler than the above and requires no large storage. The mean value of triple products on R is equal to:
>
> mv = (sum(R)^3-3*sum(R.^2)*sum(R)+2*sum(R.^3))/(n*(n-1)*(n-2));
>
> (Note: If you want to check the validity of these algorithms, use short test R vectors with appropriately reduced P1 and P2 sizes and compare those results with a direct matlab median and mean computation of a fully stored set of triple products thereof.)
>
> Roger Stafford

Dear Roger,

Thanks so much for this detailed explanation, this really helps a lot!!!

For my project, I may have to obtain the permutation as well. Can you help me about this for calculating mean and median for those triple product?Thanks a lot in advance.

Best regards 

Subject: shuffle without enough memory

From: Skirt Zhang

Date: 12 Nov, 2012 13:13:15

Message: 8 of 9

"Roger Stafford" wrote in message <k7ph5v$7jj$1@newscl01ah.mathworks.com>...
> "Skirt Zhang" wrote in message <k7jh0a$bov$1@newscl01ah.mathworks.com>...
> > My problem is I have 1000 data in columns with real value Pi. The 1-1000 are column indices. I want to calculate the mean&median values for the triple products across all the columns, and also the bi-products of them, i.e. PiPjPk and PiPj, i j k are different from each other.
> >
> > To make things faster and without violating the memory in Matlab, I calculate the possible 1e6 combinations of the column indices using the shuffle function. Then I calculate the products according to the shuffled index combinations.
> >
> > I am wondering if there is an efficient and more accurate way to calculate the mean and median ? According to a previous suggestion from Roger, I may solve it as below cited text. but I have problems to implement it :
> - - - - - - - -
> Skirt Zhang, after reviewing your recent posting and your earlier one Dec. 26, 2011, I now understand that you have one thousand "return" values and you wish to find the precise median and mean for all possible triple products of distinct values among these. In the codes below, the n=1000 return values are assumed to already be in a vector I call 'R'. (For the much simpler mean value see the last paragraph below.)
>
> There are two stages in the computation of the median. First, an approximate distribution is found based on a Monte Carlo process in the array P1. The larger P1 is, the more accurate this approximation. Second, the approximate median as determined by P1 is used to choose an initial subrange of values from all possible triple products R(i)*R(j)*R(k), i<j<k, selected so as to fit within array P2. Three counts are kept: 1) C1, the number of products below the subrange and therefore not included in P2, 2) C2, the number of products inserted into P2, and 3) C3, the number above the subrange and also not included. From the final C1, C2, and C3 values, which must sum to M=n*(n-1)*(n-2)/6, it can be determined whether or not the median value lies within P2. If it does, then the range of P2(1:C2) is sorted and that median evaluated. If the median is not within P2, the subrange is
> appropriately moved up or down and the whole process repeated. It should be borne in mind that each time a new subrange has to be selected in this manner it requires that all n*(n-1)*(n-2)/6 = 166,167,000 ordered triple products be generated, so an appreciable amount of computation time will be required for each such step.
>
> The parameter 'f' must be set less than 1. It sets the fraction of the P2 array that would theoretically be used if the distribution in P1 were perfect, so by setting it less than 1 this helps guard against potential overflow of P2. You may want to reduce 'f' if you encounter trouble with P2 overflow or to increase it if it proves to be too cautious. The parameter 'f2' should be slightly less than 2 and provides a small overlap between successive subranges whenever more than one is necessary.
>
> The following code is somewhat crude and could probably be improved upon with more intelligent adjustment of successive subranges or with better handling of P2 overflow, but this is all I have time for now. I have set very large sizes, N1 and N2, for the two arrays, P1 and P2, in view of your statement that you had 10e6 memory available. You may have to adjust these in accordance with your particular needs, but remember, larger is better!
>
> n = length(R); % = 1000
> N1 = 1e6; P1 = zeros(N1,1); % Allocate huge arrays
> N2 = 2e6; P2 = zeros(N2,1);
> M = n*(n-1)*(n-2)/6; % The total number of possible ordered triple products
> f = 3/4; % Overflow protection for P2
> f2 = 1.9; % Small overlap provided between successive subranges
>
> % Get approximate distribution
> for i1 = 1:N1
> t = ceil(n*rand(3,1));
> while t(1)==t(2)||t(2)==t(3)||t(3)==t(1) % <-- The rejection process
> t = ceil(n*rand(3,1)); % Try again
> end
> P1(i1) = prod(R(t));
> end
> P1 = sort(P1);
> % Approx. median would be (P1(floor((N1+1)/2))+P1(ceil((N1+1)/2)))/2
>
> % Now get exact median
> m = (N1+1)/2;
> w = f*N2*N1/M/2; % Half the width of the subrange
> b = true;
> while b
> D = P1(max(ceil(m-w),1)); % Get lower and upper bounds on subrange
> U = P1(min(floor(m+w),N1)); % from the P1 array
> C1 = 0; C2 = 0; C3 = 0; % Start all counts at zero
> for i1 = 1:n-2 % Generate all M possible ordered triple products
> for i2 = i1+1:n-1
> p12 = R(i1)*R(i2);
> for i3 = i2+1:n
> p = p12*R(i3);
> if p <= U
> if p >= D
> C2 = C2 + 1; % Product lies within subrange
> if C2 > N2
> error('Exceeded P2 array limit')
> end
> P2(C2) = p; % Append it to P2
> else
> C1 = C1 + 1; % Product is below subrange
> end
> else
> C3 = C3 + 1; % Product is above subrange
> end
> end
> end
> end
> if C1 > floor((M-1)/2)
> m = m - f2*w;
> fprintf('Adjust P2 subrange downwards\n')
> elseif C3 > floor((M-1)/2)
> m = m + f2*w;
> fprintf('Adjust P2 subrange upwards\n')
> else
> b = false; % Median was within P2, exit while loop
> end
> end
> P2(1:C2) = sort(P2(1:C2)); % Sort in preparation for median computation
> med = (P2(floor((M+1)/2)-C1)+P2(ceil((M+1)/2)-C1))/2; % The precise median
> fprintf('Median = %23.16e\n',med)
>
> The computation for the mean is far, far simpler than the above and requires no large storage. The mean value of triple products on R is equal to:
>
> mv = (sum(R)^3-3*sum(R.^2)*sum(R)+2*sum(R.^3))/(n*(n-1)*(n-2));
>
> (Note: If you want to check the validity of these algorithms, use short test R vectors with appropriately reduced P1 and P2 sizes and compare those results with a direct matlab median and mean computation of a fully stored set of triple products thereof.)
>
> Roger Stafford
 mv = (sum(R)^3-3*sum(R.^2)*sum(R)+2*sum(R.^3))/(n*(n-1)*(n-2));
Hi Roger I see the formulae for the triple product PiPjPk but how about the mean value for
PiPk? Thanks

Subject: shuffle without enough memory

From: Roger Stafford

Date: 12 Nov, 2012 17:34:14

Message: 9 of 9

"Skirt Zhang" <silence_qunzi@hotmail.com> wrote in message <k7qguc$iic$1@newscl01ah.mathworks.com>...
> For my project, I may have to obtain the permutation as well. Can you help me about this for calculating mean and median for those triple product?

  What do you mean by "permutation" in your "project" with triple products? Please give more details.

> Hi Roger I see the formulae for the triple product PiPjPk but how about the mean value for PiPk?

  For vector R of length n the mean value of its "double" products would be:

 mv = (sum(R)^2-sum(R.^2))/(n*(n-1));

Roger Stafford

Tags for this Thread

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Contact us