MATLAB Newsgroup

Dear All,

I have used this shuffle function for my problem:

I need to form 10^9 combinations by randomly select three from 1000 integers (1.....1000 refers to the firm identity). After finding the combinations, I refer to single firms' return and calculate the product of the three and get one value, I call it "Prod". So I should have 10^9 "Prod"s. Finally I need to calculate the median value for all these Prods.

Thanks to Jan Simon,

http://www.mathworks.com/matlabcentral/fileexchange/27076-shuffle

Using shuffle, I manage to do this. However, the memory in my PC can only allow me to have 10^7 combinations, and the result is not good given the limit of nr of the memory.

Can anyone suggest me a good way to find the median value from the 10^7 prods?

Instead of telling me to change the computer for larger space?

Thanks a lot in advance and wish u all a nice holiday,

"Skirt Zhang" wrote in message <jcvfv8$6pt$1@newscl01ah.mathworks.com>...

> I need to form 10^9 combinations by randomly select three from 1000 integers (1.....1000 refers to the firm identity). After finding the combinations, I refer to single firms' return and calculate the product of the three and get one value, I call it "Prod". So I should have 10^9 "Prod"s. Finally I need to calculate the median value for all these Prods.

> .......

> Can anyone suggest me a good way to find the median value from the 10^7 prods?

- - - - - - - - -

It looks as though you are using a "Monte Carlo" method of approximating the median for your random process, but apparently you don't have enough memory to hold a sufficiently representative number of samples. If the median is the only thing you wish to find, there is a way of determining it precisely (as opposed to a Monte Carlo approximation) without providing arrays that are large enough to hold the products of all possible combinations. However it does require that you have a rough initial estimate of that median value.

Suppose that of the 1000*999*998/6 = 166167000 total number of combinations of 1000 numbers taken three at a time, you have good reason to believe that the median of their products is a value somewhere between about 65 million and 75 million. Using only the 10^7 length array that you have stated would fit in your memory, it is possible to find the precise median of the products of all those combinations, provided this guess is correct. You can proceed along the following lines:

L = 65e6;

S = 1e7;

C = zeros(S,1);

N = 1000;

for i1 = 1:N-2

for i2 = i1+1:N-1

p12 = i1*i2;

for i3 = i2+1:N

p123 = min(max(p12*i3-L+1,1),S);

C(p123) = C(p123)+1;

end

end

end

C = cumsum(C);

Taking into account its "L-1" offset, the array C now contains all the necessary information for obtaining the desired median, provided of course that the above guess was correct. You are looking for the first occurrence in C of a cumulative count that is greater than or equal to the lower midpoint, floor(C(end)/2), (which is of course 83083500,) and with a correct guess above, that should occur somewhere between C(2) and C(end-1). (Since 166167000 is even you are also looking for the first occurrence greater than or equal to 83083501 to get a valid median.) I will leave the messy details of doing all that to you. (If need be, these details can be expanded upon in subsequent postings.)

As to justifying the above guess, one can reason as follows. First, N=1000 is sufficiently large that we can include cases where the numbers are repeated among the triplets without affecting the resulting median seriously. Moreover for the same reason we can assume a continuum of values between 0 and 1000 instead of discrete integers. This makes it a problem in calculus. Thus we then have a constant probability density of 1/N^3 in R^3 inside the cube 0<=x<=N, 0<=y<=N, and 0<=z<=N. We want the median of the product f(x,y,z) = x*y*z in that cube. The cumulative distribution function (CDF) is:

F(c) = Pr(x*y*z<=c) = 1 - Pr(x*y*z>c) = 1 - Pr(u*v*w>t)

where u = x/N, v = y/N, w = z/N, and t = c/N^3 and where the density of u, v, w over its unit cube is just 1. We can obtain an exact answer for the latter quantity (courtesy of the Symbolic Toolbox):

F(c) = 1-int(int(int(1,'w',t/(u*v),1),'v',t/u,1),'u',t,1) =

t*(1-log(t)+1/2*log(t)^2)

The value of t for which this expression equals 1/2 (for the median) is:

t = .6897160962790907e-1

so that c at the median would be c = t*N^3 = 68971609.62790907 .

That is the justification for using the above estimate between 65 and 75 millions. Note that, depending on how accurate this estimate is, one can use shorter arrays than above if necessary. Of course if an estimate should prove to be sufficiently in error that the above nested for-loops fail to entrap the median in C(2:end-1), appropriate adjustments can be made with the value 'L' for other tries.

Roger Stafford

Thanks a lot Roger!??????????????? ?????

Wish you a very nice holiday and merry christmas :)

"Roger Stafford" wrote in message <jd5cmk$cmu$1@newscl01ah.mathworks.com>...

> "Skirt Zhang" wrote in message <jcvfv8$6pt$1@newscl01ah.mathworks.com>...

> > I need to form 10^9 combinations by randomly select three from 1000 integers (1.....1000 refers to the firm identity). After finding the combinations, I refer to single firms' return and calculate the product of the three and get one value, I call it "Prod". So I should have 10^9 "Prod"s. Finally I need to calculate the median value for all these Prods.

> > .......

> > Can anyone suggest me a good way to find the median value from the 10^7 prods?

> - - - - - - - - -

> It looks as though you are using a "Monte Carlo" method of approximating the median for your random process, but apparently you don't have enough memory to hold a sufficiently representative number of samples. If the median is the only thing you wish to find, there is a way of determining it precisely (as opposed to a Monte Carlo approximation) without providing arrays that are large enough to hold the products of all possible combinations. However it does require that you have a rough initial estimate of that median value.

>

> Suppose that of the 1000*999*998/6 = 166167000 total number of combinations of 1000 numbers taken three at a time, you have good reason to believe that the median of their products is a value somewhere between about 65 million and 75 million. Using only the 10^7 length array that you have stated would fit in your memory, it is possible to find the precise median of the products of all those combinations, provided this guess is correct. You can proceed along the following lines:

>

> L = 65e6;

> S = 1e7;

> C = zeros(S,1);

> N = 1000;

> for i1 = 1:N-2

> for i2 = i1+1:N-1

> p12 = i1*i2;

> for i3 = i2+1:N

> p123 = min(max(p12*i3-L+1,1),S);

> C(p123) = C(p123)+1;

> end

> end

> end

> C = cumsum(C);

>

> Taking into account its "L-1" offset, the array C now contains all the necessary information for obtaining the desired median, provided of course that the above guess was correct. You are looking for the first occurrence in C of a cumulative count that is greater than or equal to the lower midpoint, floor(C(end)/2), (which is of course 83083500,) and with a correct guess above, that should occur somewhere between C(2) and C(end-1). (Since 166167000 is even you are also looking for the first occurrence greater than or equal to 83083501 to get a valid median.) I will leave the messy details of doing all that to you. (If need be, these details can be expanded upon in subsequent postings.)

>

> As to justifying the above guess, one can reason as follows. First, N=1000 is sufficiently large that we can include cases where the numbers are repeated among the triplets without affecting the resulting median seriously. Moreover for the same reason we can assume a continuum of values between 0 and 1000 instead of discrete integers. This makes it a problem in calculus. Thus we then have a constant probability density of 1/N^3 in R^3 inside the cube 0<=x<=N, 0<=y<=N, and 0<=z<=N. We want the median of the product f(x,y,z) = x*y*z in that cube. The cumulative distribution function (CDF) is:

>

> F(c) = Pr(x*y*z<=c) = 1 - Pr(x*y*z>c) = 1 - Pr(u*v*w>t)

>

> where u = x/N, v = y/N, w = z/N, and t = c/N^3 and where the density of u, v, w over its unit cube is just 1. We can obtain an exact answer for the latter quantity (courtesy of the Symbolic Toolbox):

>

> F(c) = 1-int(int(int(1,'w',t/(u*v),1),'v',t/u,1),'u',t,1) =

> t*(1-log(t)+1/2*log(t)^2)

>

> The value of t for which this expression equals 1/2 (for the median) is:

>

> t = .6897160962790907e-1

>

> so that c at the median would be c = t*N^3 = 68971609.62790907 .

>

> That is the justification for using the above estimate between 65 and 75 millions. Note that, depending on how accurate this estimate is, one can use shorter arrays than above if necessary. Of course if an estimate should prove to be sufficiently in error that the above nested for-loops fail to entrap the median in C(2:end-1), appropriate adjustments can be made with the value 'L' for other tries.

>

> Roger Stafford

"Roger Stafford" wrote in message <jd5cmk$cmu$1@newscl01ah.mathworks.com>...

> "Skirt Zhang" wrote in message <jcvfv8$6pt$1@newscl01ah.mathworks.com>...

> > I need to form 10^9 combinations by randomly select three from 1000 integers (1.....1000 refers to the firm identity). After finding the combinations, I refer to single firms' return and calculate the product of the three and get one value, I call it "Prod". So I should have 10^9 "Prod"s. Finally I need to calculate the median value for all these Prods.

> > .......

> > Can anyone suggest me a good way to find the median value from the 10^7 prods?

> - - - - - - - - -

> It looks as though you are using a "Monte Carlo" method of approximating the median for your random process, but apparently you don't have enough memory to hold a sufficiently representative number of samples. If the median is the only thing you wish to find, there is a way of determining it precisely (as opposed to a Monte Carlo approximation) without providing arrays that are large enough to hold the products of all possible combinations. However it does require that you have a rough initial estimate of that median value.

>

> Suppose that of the 1000*999*998/6 = 166167000 total number of combinations of 1000 numbers taken three at a time, you have good reason to believe that the median of their products is a value somewhere between about 65 million and 75 million. Using only the 10^7 length array that you have stated would fit in your memory, it is possible to find the precise median of the products of all those combinations, provided this guess is correct. You can proceed along the following lines:

>

> L = 65e6;

> S = 1e7;

> C = zeros(S,1);

> N = 1000;

> for i1 = 1:N-2

> for i2 = i1+1:N-1

> p12 = i1*i2;

> for i3 = i2+1:N

> p123 = min(max(p12*i3-L+1,1),S);

> C(p123) = C(p123)+1;

> end

> end

> end

> C = cumsum(C);

>

> Taking into account its "L-1" offset, the array C now contains all the necessary information for obtaining the desired median, provided of course that the above guess was correct. You are looking for the first occurrence in C of a cumulative count that is greater than or equal to the lower midpoint, floor(C(end)/2), (which is of course 83083500,) and with a correct guess above, that should occur somewhere between C(2) and C(end-1). (Since 166167000 is even you are also looking for the first occurrence greater than or equal to 83083501 to get a valid median.) I will leave the messy details of doing all that to you. (If need be, these details can be expanded upon in subsequent postings.)

>

> As to justifying the above guess, one can reason as follows. First, N=1000 is sufficiently large that we can include cases where the numbers are repeated among the triplets without affecting the resulting median seriously. Moreover for the same reason we can assume a continuum of values between 0 and 1000 instead of discrete integers. This makes it a problem in calculus. Thus we then have a constant probability density of 1/N^3 in R^3 inside the cube 0<=x<=N, 0<=y<=N, and 0<=z<=N. We want the median of the product f(x,y,z) = x*y*z in that cube. The cumulative distribution function (CDF) is:

>

> F(c) = Pr(x*y*z<=c) = 1 - Pr(x*y*z>c) = 1 - Pr(u*v*w>t)

>

> where u = x/N, v = y/N, w = z/N, and t = c/N^3 and where the density of u, v, w over its unit cube is just 1. We can obtain an exact answer for the latter quantity (courtesy of the Symbolic Toolbox):

>

> F(c) = 1-int(int(int(1,'w',t/(u*v),1),'v',t/u,1),'u',t,1) =

> t*(1-log(t)+1/2*log(t)^2)

>

> The value of t for which this expression equals 1/2 (for the median) is:

>

> t = .6897160962790907e-1

>

> so that c at the median would be c = t*N^3 = 68971609.62790907 .

>

> That is the justification for using the above estimate between 65 and 75 millions. Note that, depending on how accurate this estimate is, one can use shorter arrays than above if necessary. Of course if an estimate should prove to be sufficiently in error that the above nested for-loops fail to entrap the median in C(2:end-1), appropriate adjustments can be made with the value 'L' for other tries.

>

> Roger Stafford

Hi Roger,

I am trying to implement your idea. However, for the 1 to 1000 integers I just use them for indexing/representing the firm's identity, i.e. 2 means firm 2's return. So I need to find the median for all the Prods' of 1000 returns, e.g. R1R2R3.....

Does this procedure mean you are calculating the Prod's of any 3 integers from 1:1000?

p123 = min(max(p12*i3-L+1,1),S);

C(p123) = C(p123)+1;

Hi Roger,

I have difficulty to implement your suggestions.

My problem is I have 1000 data in columns with real value Pi. The 1-1000 are column indices. I want to calculate the mean&median values for the triple products across all the columns, and also the bi-products of them, i.e. PiPjPk and PiPj, i j k are different from each other.

To make things faster and without violating the memory in Matlab, I calculate the possible 1e6 combinations of the column indices using the shuffle function. Then I calculate the products according to the shuffled index combinations.

I am wondering if there is an efficient and more accurate way to calculate the mean and median ? According to a previous suggestion from Roger, I may solve it as below cited text. but I have problems to implement it :

First, I generate 1e7 numbers but I found

floor(C(end)/2), (which is of course 83083500,)

this floor 83083500 is bigger than my 1e7 simulated numbers. So How can I proceed using your approach?

If I need to find the triple product value greater or equal to the product value at 83083500, I need to calculate the product values for 1000*999*998/6=166167000. This seems not saving much effort...

Any suggestions are highly appreciated,

Best regards,

"Roger Stafford" wrote in message <jd5cmk$cmu$1@newscl01ah.mathworks.com>...

> "Skirt Zhang" wrote in message <jcvfv8$6pt$1@newscl01ah.mathworks.com>...

> > I need to form 10^9 combinations by randomly select three from 1000 integers (1.....1000 refers to the firm identity). After finding the combinations, I refer to single firms' return and calculate the product of the three and get one value, I call it "Prod". So I should have 10^9 "Prod"s. Finally I need to calculate the median value for all these Prods.

> > .......

> > Can anyone suggest me a good way to find the median value from the 10^7 prods?

> - - - - - - - - -

> It looks as though you are using a "Monte Carlo" method of approximating the median for your random process, but apparently you don't have enough memory to hold a sufficiently representative number of samples. If the median is the only thing you wish to find, there is a way of determining it precisely (as opposed to a Monte Carlo approximation) without providing arrays that are large enough to hold the products of all possible combinations. However it does require that you have a rough initial estimate of that median value.

>

> Suppose that of the 1000*999*998/6 = 166167000 total number of combinations of 1000 numbers taken three at a time, you have good reason to believe that the median of their products is a value somewhere between about 65 million and 75 million. Using only the 10^7 length array that you have stated would fit in your memory, it is possible to find the precise median of the products of all those combinations, provided this guess is correct. You can proceed along the following lines:

>

> L = 65e6;

> S = 1e7;

> C = zeros(S,1);

> N = 1000;

> for i1 = 1:N-2

> for i2 = i1+1:N-1

> p12 = i1*i2;

> for i3 = i2+1:N

> p123 = min(max(p12*i3-L+1,1),S);

> C(p123) = C(p123)+1;

> end

> end

> end

> C = cumsum(C);

>

> Taking into account its "L-1" offset, the array C now contains all the necessary information for obtaining the desired median, provided of course that the above guess was correct. You are looking for the first occurrence in C of a cumulative count that is greater than or equal to the lower midpoint, floor(C(end)/2), (which is of course 83083500,) and with a correct guess above, that should occur somewhere between C(2) and C(end-1). (Since 166167000 is even you are also looking for the first occurrence greater than or equal to 83083501 to get a valid median.) I will leave the messy details of doing all that to you. (If need be, these details can be expanded upon in subsequent postings.)

>

> As to justifying the above guess, one can reason as follows. First, N=1000 is sufficiently large that we can include cases where the numbers are repeated among the triplets without affecting the resulting median seriously. Moreover for the same reason we can assume a continuum of values between 0 and 1000 instead of discrete integers. This makes it a problem in calculus. Thus we then have a constant probability density of 1/N^3 in R^3 inside the cube 0<=x<=N, 0<=y<=N, and 0<=z<=N. We want the median of the product f(x,y,z) = x*y*z in that cube. The cumulative distribution function (CDF) is:

>

> F(c) = Pr(x*y*z<=c) = 1 - Pr(x*y*z>c) = 1 - Pr(u*v*w>t)

>

> where u = x/N, v = y/N, w = z/N, and t = c/N^3 and where the density of u, v, w over its unit cube is just 1. We can obtain an exact answer for the latter quantity (courtesy of the Symbolic Toolbox):

>

> F(c) = 1-int(int(int(1,'w',t/(u*v),1),'v',t/u,1),'u',t,1) =

> t*(1-log(t)+1/2*log(t)^2)

>

> The value of t for which this expression equals 1/2 (for the median) is:

>

> t = .6897160962790907e-1

>

> so that c at the median would be c = t*N^3 = 68971609.62790907 .

>

> That is the justification for using the above estimate between 65 and 75 millions. Note that, depending on how accurate this estimate is, one can use shorter arrays than above if necessary. Of course if an estimate should prove to be sufficiently in error that the above nested for-loops fail to entrap the median in C(2:end-1), appropriate adjustments can be made with the value 'L' for other tries.

>

> Roger Stafford

"Skirt Zhang" wrote in message <k7jh0a$bov$1@newscl01ah.mathworks.com>...

> My problem is I have 1000 data in columns with real value Pi. The 1-1000 are column indices. I want to calculate the mean&median values for the triple products across all the columns, and also the bi-products of them, i.e. PiPjPk and PiPj, i j k are different from each other.

>

> To make things faster and without violating the memory in Matlab, I calculate the possible 1e6 combinations of the column indices using the shuffle function. Then I calculate the products according to the shuffled index combinations.

>

> I am wondering if there is an efficient and more accurate way to calculate the mean and median ? According to a previous suggestion from Roger, I may solve it as below cited text. but I have problems to implement it :

- - - - - - - -

Skirt Zhang, after reviewing your recent posting and your earlier one Dec. 26, 2011, I now understand that you have one thousand "return" values and you wish to find the precise median and mean for all possible triple products of distinct values among these. In the codes below, the n=1000 return values are assumed to already be in a vector I call 'R'. (For the much simpler mean value see the last paragraph below.)

There are two stages in the computation of the median. First, an approximate distribution is found based on a Monte Carlo process in the array P1. The larger P1 is, the more accurate this approximation. Second, the approximate median as determined by P1 is used to choose an initial subrange of values from all possible triple products R(i)*R(j)*R(k), i<j<k, selected so as to fit within array P2. Three counts are kept: 1) C1, the number of products below the subrange and therefore not included in P2, 2) C2, the number of products inserted into P2, and 3) C3, the number above the subrange and also not included. From the final C1, C2, and C3 values, which must sum to M=n*(n-1)*(n-2)/6, it can be determined whether or not the median value lies within P2. If it does, then the range of P2(1:C2) is sorted and that median evaluated. If the median is not within P2, the subrange is

appropriately moved up or down and the whole process repeated. It should be borne in mind that each time a new subrange has to be selected in this manner it requires that all n*(n-1)*(n-2)/6 = 166,167,000 ordered triple products be generated, so an appreciable amount of computation time will be required for each such step.

The parameter 'f' must be set less than 1. It sets the fraction of the P2 array that would theoretically be used if the distribution in P1 were perfect, so by setting it less than 1 this helps guard against potential overflow of P2. You may want to reduce 'f' if you encounter trouble with P2 overflow or to increase it if it proves to be too cautious. The parameter 'f2' should be slightly less than 2 and provides a small overlap between successive subranges whenever more than one is necessary.

The following code is somewhat crude and could probably be improved upon with more intelligent adjustment of successive subranges or with better handling of P2 overflow, but this is all I have time for now. I have set very large sizes, N1 and N2, for the two arrays, P1 and P2, in view of your statement that you had 10e6 memory available. You may have to adjust these in accordance with your particular needs, but remember, larger is better!

n = length(R); % = 1000

N1 = 1e6; P1 = zeros(N1,1); % Allocate huge arrays

N2 = 2e6; P2 = zeros(N2,1);

M = n*(n-1)*(n-2)/6; % The total number of possible ordered triple products

f = 3/4; % Overflow protection for P2

f2 = 1.9; % Small overlap provided between successive subranges

% Get approximate distribution

for i1 = 1:N1

t = ceil(n*rand(3,1));

while t(1)==t(2)||t(2)==t(3)||t(3)==t(1) % <-- The rejection process

t = ceil(n*rand(3,1)); % Try again

end

P1(i1) = prod(R(t));

end

P1 = sort(P1);

% Approx. median would be (P1(floor((N1+1)/2))+P1(ceil((N1+1)/2)))/2

% Now get exact median

m = (N1+1)/2;

w = f*N2*N1/M/2; % Half the width of the subrange

b = true;

while b

D = P1(max(ceil(m-w),1)); % Get lower and upper bounds on subrange

U = P1(min(floor(m+w),N1)); % from the P1 array

C1 = 0; C2 = 0; C3 = 0; % Start all counts at zero

for i1 = 1:n-2 % Generate all M possible ordered triple products

for i2 = i1+1:n-1

p12 = R(i1)*R(i2);

for i3 = i2+1:n

p = p12*R(i3);

if p <= U

if p >= D

C2 = C2 + 1; % Product lies within subrange

if C2 > N2

error('Exceeded P2 array limit')

end

P2(C2) = p; % Append it to P2

else

C1 = C1 + 1; % Product is below subrange

end

else

C3 = C3 + 1; % Product is above subrange

end

end

end

end

if C1 > floor((M-1)/2)

m = m - f2*w;

fprintf('Adjust P2 subrange downwards\n')

elseif C3 > floor((M-1)/2)

m = m + f2*w;

fprintf('Adjust P2 subrange upwards\n')

else

b = false; % Median was within P2, exit while loop

end

end

P2(1:C2) = sort(P2(1:C2)); % Sort in preparation for median computation

med = (P2(floor((M+1)/2)-C1)+P2(ceil((M+1)/2)-C1))/2; % The precise median

fprintf('Median = %23.16e\n',med)

The computation for the mean is far, far simpler than the above and requires no large storage. The mean value of triple products on R is equal to:

mv = (sum(R)^3-3*sum(R.^2)*sum(R)+2*sum(R.^3))/(n*(n-1)*(n-2));

(Note: If you want to check the validity of these algorithms, use short test R vectors with appropriately reduced P1 and P2 sizes and compare those results with a direct matlab median and mean computation of a fully stored set of triple products thereof.)

Roger Stafford

"Roger Stafford" wrote in message <k7ph5v$7jj$1@newscl01ah.mathworks.com>...

> "Skirt Zhang" wrote in message <k7jh0a$bov$1@newscl01ah.mathworks.com>...

> > My problem is I have 1000 data in columns with real value Pi. The 1-1000 are column indices. I want to calculate the mean&median values for the triple products across all the columns, and also the bi-products of them, i.e. PiPjPk and PiPj, i j k are different from each other.

> >

> > To make things faster and without violating the memory in Matlab, I calculate the possible 1e6 combinations of the column indices using the shuffle function. Then I calculate the products according to the shuffled index combinations.

> >

> > I am wondering if there is an efficient and more accurate way to calculate the mean and median ? According to a previous suggestion from Roger, I may solve it as below cited text. but I have problems to implement it :

> - - - - - - - -

> Skirt Zhang, after reviewing your recent posting and your earlier one Dec. 26, 2011, I now understand that you have one thousand "return" values and you wish to find the precise median and mean for all possible triple products of distinct values among these. In the codes below, the n=1000 return values are assumed to already be in a vector I call 'R'. (For the much simpler mean value see the last paragraph below.)

>

> There are two stages in the computation of the median. First, an approximate distribution is found based on a Monte Carlo process in the array P1. The larger P1 is, the more accurate this approximation. Second, the approximate median as determined by P1 is used to choose an initial subrange of values from all possible triple products R(i)*R(j)*R(k), i<j<k, selected so as to fit within array P2. Three counts are kept: 1) C1, the number of products below the subrange and therefore not included in P2, 2) C2, the number of products inserted into P2, and 3) C3, the number above the subrange and also not included. From the final C1, C2, and C3 values, which must sum to M=n*(n-1)*(n-2)/6, it can be determined whether or not the median value lies within P2. If it does, then the range of P2(1:C2) is sorted and that median evaluated. If the median is not within P2, the subrange is

> appropriately moved up or down and the whole process repeated. It should be borne in mind that each time a new subrange has to be selected in this manner it requires that all n*(n-1)*(n-2)/6 = 166,167,000 ordered triple products be generated, so an appreciable amount of computation time will be required for each such step.

>

> The parameter 'f' must be set less than 1. It sets the fraction of the P2 array that would theoretically be used if the distribution in P1 were perfect, so by setting it less than 1 this helps guard against potential overflow of P2. You may want to reduce 'f' if you encounter trouble with P2 overflow or to increase it if it proves to be too cautious. The parameter 'f2' should be slightly less than 2 and provides a small overlap between successive subranges whenever more than one is necessary.

>

> The following code is somewhat crude and could probably be improved upon with more intelligent adjustment of successive subranges or with better handling of P2 overflow, but this is all I have time for now. I have set very large sizes, N1 and N2, for the two arrays, P1 and P2, in view of your statement that you had 10e6 memory available. You may have to adjust these in accordance with your particular needs, but remember, larger is better!

>

> n = length(R); % = 1000

> N1 = 1e6; P1 = zeros(N1,1); % Allocate huge arrays

> N2 = 2e6; P2 = zeros(N2,1);

> M = n*(n-1)*(n-2)/6; % The total number of possible ordered triple products

> f = 3/4; % Overflow protection for P2

> f2 = 1.9; % Small overlap provided between successive subranges

>

> % Get approximate distribution

> for i1 = 1:N1

> t = ceil(n*rand(3,1));

> while t(1)==t(2)||t(2)==t(3)||t(3)==t(1) % <-- The rejection process

> t = ceil(n*rand(3,1)); % Try again

> end

> P1(i1) = prod(R(t));

> end

> P1 = sort(P1);

> % Approx. median would be (P1(floor((N1+1)/2))+P1(ceil((N1+1)/2)))/2

>

> % Now get exact median

> m = (N1+1)/2;

> w = f*N2*N1/M/2; % Half the width of the subrange

> b = true;

> while b

> D = P1(max(ceil(m-w),1)); % Get lower and upper bounds on subrange

> U = P1(min(floor(m+w),N1)); % from the P1 array

> C1 = 0; C2 = 0; C3 = 0; % Start all counts at zero

> for i1 = 1:n-2 % Generate all M possible ordered triple products

> for i2 = i1+1:n-1

> p12 = R(i1)*R(i2);

> for i3 = i2+1:n

> p = p12*R(i3);

> if p <= U

> if p >= D

> C2 = C2 + 1; % Product lies within subrange

> if C2 > N2

> error('Exceeded P2 array limit')

> end

> P2(C2) = p; % Append it to P2

> else

> C1 = C1 + 1; % Product is below subrange

> end

> else

> C3 = C3 + 1; % Product is above subrange

> end

> end

> end

> end

> if C1 > floor((M-1)/2)

> m = m - f2*w;

> fprintf('Adjust P2 subrange downwards\n')

> elseif C3 > floor((M-1)/2)

> m = m + f2*w;

> fprintf('Adjust P2 subrange upwards\n')

> else

> b = false; % Median was within P2, exit while loop

> end

> end

> P2(1:C2) = sort(P2(1:C2)); % Sort in preparation for median computation

> med = (P2(floor((M+1)/2)-C1)+P2(ceil((M+1)/2)-C1))/2; % The precise median

> fprintf('Median = %23.16e\n',med)

>

> The computation for the mean is far, far simpler than the above and requires no large storage. The mean value of triple products on R is equal to:

>

> mv = (sum(R)^3-3*sum(R.^2)*sum(R)+2*sum(R.^3))/(n*(n-1)*(n-2));

>

> (Note: If you want to check the validity of these algorithms, use short test R vectors with appropriately reduced P1 and P2 sizes and compare those results with a direct matlab median and mean computation of a fully stored set of triple products thereof.)

>

> Roger Stafford

Dear Roger,

Thanks so much for this detailed explanation, this really helps a lot!!!

For my project, I may have to obtain the permutation as well. Can you help me about this for calculating mean and median for those triple product?Thanks a lot in advance.

Best regards

"Roger Stafford" wrote in message <k7ph5v$7jj$1@newscl01ah.mathworks.com>...

> "Skirt Zhang" wrote in message <k7jh0a$bov$1@newscl01ah.mathworks.com>...

> > My problem is I have 1000 data in columns with real value Pi. The 1-1000 are column indices. I want to calculate the mean&median values for the triple products across all the columns, and also the bi-products of them, i.e. PiPjPk and PiPj, i j k are different from each other.

> >

> > To make things faster and without violating the memory in Matlab, I calculate the possible 1e6 combinations of the column indices using the shuffle function. Then I calculate the products according to the shuffled index combinations.

> >

> > I am wondering if there is an efficient and more accurate way to calculate the mean and median ? According to a previous suggestion from Roger, I may solve it as below cited text. but I have problems to implement it :

> - - - - - - - -

> Skirt Zhang, after reviewing your recent posting and your earlier one Dec. 26, 2011, I now understand that you have one thousand "return" values and you wish to find the precise median and mean for all possible triple products of distinct values among these. In the codes below, the n=1000 return values are assumed to already be in a vector I call 'R'. (For the much simpler mean value see the last paragraph below.)

>

> There are two stages in the computation of the median. First, an approximate distribution is found based on a Monte Carlo process in the array P1. The larger P1 is, the more accurate this approximation. Second, the approximate median as determined by P1 is used to choose an initial subrange of values from all possible triple products R(i)*R(j)*R(k), i<j<k, selected so as to fit within array P2. Three counts are kept: 1) C1, the number of products below the subrange and therefore not included in P2, 2) C2, the number of products inserted into P2, and 3) C3, the number above the subrange and also not included. From the final C1, C2, and C3 values, which must sum to M=n*(n-1)*(n-2)/6, it can be determined whether or not the median value lies within P2. If it does, then the range of P2(1:C2) is sorted and that median evaluated. If the median is not within P2, the subrange is

> appropriately moved up or down and the whole process repeated. It should be borne in mind that each time a new subrange has to be selected in this manner it requires that all n*(n-1)*(n-2)/6 = 166,167,000 ordered triple products be generated, so an appreciable amount of computation time will be required for each such step.

>

> The parameter 'f' must be set less than 1. It sets the fraction of the P2 array that would theoretically be used if the distribution in P1 were perfect, so by setting it less than 1 this helps guard against potential overflow of P2. You may want to reduce 'f' if you encounter trouble with P2 overflow or to increase it if it proves to be too cautious. The parameter 'f2' should be slightly less than 2 and provides a small overlap between successive subranges whenever more than one is necessary.

>

> The following code is somewhat crude and could probably be improved upon with more intelligent adjustment of successive subranges or with better handling of P2 overflow, but this is all I have time for now. I have set very large sizes, N1 and N2, for the two arrays, P1 and P2, in view of your statement that you had 10e6 memory available. You may have to adjust these in accordance with your particular needs, but remember, larger is better!

>

> n = length(R); % = 1000

> N1 = 1e6; P1 = zeros(N1,1); % Allocate huge arrays

> N2 = 2e6; P2 = zeros(N2,1);

> M = n*(n-1)*(n-2)/6; % The total number of possible ordered triple products

> f = 3/4; % Overflow protection for P2

> f2 = 1.9; % Small overlap provided between successive subranges

>

> % Get approximate distribution

> for i1 = 1:N1

> t = ceil(n*rand(3,1));

> while t(1)==t(2)||t(2)==t(3)||t(3)==t(1) % <-- The rejection process

> t = ceil(n*rand(3,1)); % Try again

> end

> P1(i1) = prod(R(t));

> end

> P1 = sort(P1);

> % Approx. median would be (P1(floor((N1+1)/2))+P1(ceil((N1+1)/2)))/2

>

> % Now get exact median

> m = (N1+1)/2;

> w = f*N2*N1/M/2; % Half the width of the subrange

> b = true;

> while b

> D = P1(max(ceil(m-w),1)); % Get lower and upper bounds on subrange

> U = P1(min(floor(m+w),N1)); % from the P1 array

> C1 = 0; C2 = 0; C3 = 0; % Start all counts at zero

> for i1 = 1:n-2 % Generate all M possible ordered triple products

> for i2 = i1+1:n-1

> p12 = R(i1)*R(i2);

> for i3 = i2+1:n

> p = p12*R(i3);

> if p <= U

> if p >= D

> C2 = C2 + 1; % Product lies within subrange

> if C2 > N2

> error('Exceeded P2 array limit')

> end

> P2(C2) = p; % Append it to P2

> else

> C1 = C1 + 1; % Product is below subrange

> end

> else

> C3 = C3 + 1; % Product is above subrange

> end

> end

> end

> end

> if C1 > floor((M-1)/2)

> m = m - f2*w;

> fprintf('Adjust P2 subrange downwards\n')

> elseif C3 > floor((M-1)/2)

> m = m + f2*w;

> fprintf('Adjust P2 subrange upwards\n')

> else

> b = false; % Median was within P2, exit while loop

> end

> end

> P2(1:C2) = sort(P2(1:C2)); % Sort in preparation for median computation

> med = (P2(floor((M+1)/2)-C1)+P2(ceil((M+1)/2)-C1))/2; % The precise median

> fprintf('Median = %23.16e\n',med)

>

> The computation for the mean is far, far simpler than the above and requires no large storage. The mean value of triple products on R is equal to:

>

> mv = (sum(R)^3-3*sum(R.^2)*sum(R)+2*sum(R.^3))/(n*(n-1)*(n-2));

>

> (Note: If you want to check the validity of these algorithms, use short test R vectors with appropriately reduced P1 and P2 sizes and compare those results with a direct matlab median and mean computation of a fully stored set of triple products thereof.)

>

> Roger Stafford

mv = (sum(R)^3-3*sum(R.^2)*sum(R)+2*sum(R.^3))/(n*(n-1)*(n-2));

Hi Roger I see the formulae for the triple product PiPjPk but how about the mean value for

PiPk? Thanks

"Skirt Zhang" <silence_qunzi@hotmail.com> wrote in message <k7qguc$iic$1@newscl01ah.mathworks.com>...

> For my project, I may have to obtain the permutation as well. Can you help me about this for calculating mean and median for those triple product?

What do you mean by "permutation" in your "project" with triple products? Please give more details.

> Hi Roger I see the formulae for the triple product PiPjPk but how about the mean value for PiPk?

For vector R of length n the mean value of its "double" products would be:

mv = (sum(R)^2-sum(R.^2))/(n*(n-1));

Roger Stafford

You can think of your watch list as threads that you have bookmarked.

You can add tags, authors, threads, and even search results to your watch list. This way you can easily keep track of topics that you're interested in. To view your watch list, click on the "My Newsreader" link.

To add items to your watch list, click the "add to watch list" link at the bottom of any page.

To add search criteria to your watch list, search for the desired term in the search box. Click on the "Add this search to my watch list" link on the search results page.

You can also add a tag to your watch list by searching for the tag with the directive "tag:tag_name" where tag_name is the name of the tag you would like to watch.

To add an author to your watch list, go to the author's profile page and click on the "Add this author to my watch list" link at the top of the page. You can also add an author to your watch list by going to a thread that the author has posted to and clicking on the "Add this author to my watch list" link. You will be notified whenever the author makes a post.

To add a thread to your watch list, go to the thread page and click the "Add this thread to my watch list" link at the top of the page.

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

The newsgroups are a worldwide forum that is open to everyone. Newsgroups are used to discuss a huge range of topics, make announcements, and trade files.

Discussions are threaded, or grouped in a way that allows you to read a posted message and all of its replies in chronological order. This makes it easy to follow the thread of the conversation, and to see what’s already been said before you post your own reply or make a new posting.

Newsgroup content is distributed by servers hosted by various organizations on the Internet. Messages are exchanged and managed using open-standard protocols. No single entity “owns” the newsgroups.

There are thousands of newsgroups, each addressing a single topic or area of interest. The MATLAB Central Newsreader posts and displays messages in the comp.soft-sys.matlab newsgroup.

**MATLAB Central**

You can use the integrated newsreader at the MATLAB Central website to read and post messages in this newsgroup. MATLAB Central is hosted by MathWorks.

Messages posted through the MATLAB Central Newsreader are seen by everyone using the newsgroups, regardless of how they access the newsgroups. There are several advantages to using MATLAB Central.

**One Account**

Your MATLAB Central account is tied to your MathWorks Account for easy access.

**Use the Email Address of Your Choice**

The MATLAB Central Newsreader allows you to define an alternative email address as your posting address, avoiding clutter in your primary mailbox and reducing spam.

**Spam Control**

Most newsgroup spam is filtered out by the MATLAB Central Newsreader.

**Tagging**

Messages can be tagged with a relevant label by any signed-in user. Tags can be used as keywords to find particular files of interest, or as a way to categorize your bookmarked postings. You may choose to allow others to view your tags, and you can view or search others’ tags as well as those of the community at large. Tagging provides a way to see both the big trends and the smaller, more obscure ideas and applications.

**Watch lists**

Setting up watch lists allows you to be notified of updates made to postings selected by author, thread, or any search variable. Your watch list notifications can be sent by email (daily digest or immediate), displayed in My Newsreader, or sent via RSS feed.

- Use a newsreader through your school, employer, or internet service provider
- Pay for newsgroup access from a commercial provider
- Use Google Groups
- Mathforum.org provides a newsreader with access to the comp.soft sys.matlab newsgroup
- Run your own server. For typical instructions, see: http://www.slyck.com/ng.php?page=2