Random elements of a vector which sum near a specific number

2 views (last 30 days)
This is a fun question. I have a large vector which I need to take out 1000 random sub-vectors whose elements all add up to around the same value.
The length of this vector is 33268 with an average value of 0.028743, max of 6.0637, min of 0.006063, and std of 0.040495. The total sum is 956.23 and I need subvectors which add to 20, 30, and so on... So I think this should be possible while being fairly random. This is my attempt:
for j=1:1000
while 1
randoms = (randi(2,[1 length(parentweight)])-1);
sumweights(i) = sum(parentweight(find(randoms)));
if (sumweights>=n-mean(parentweight))&&(sumweights<=n+mean(parentweight))
break;
end
end
randsubinds{j} = randoms;
end
end
I let this run for 86000 iterations and the minimum sum was 462.19 and the max was 495.35 with a 4.5 std. In other words it seems to be always choosing almost exactly half of the elements and will never go down to around 20. That's because it's a uniform distribution I assume but there's no poisson distributed integer function.
While typing this out I came up with this:
for j=1:1000
broken = 0;
while 1
for i = 1:length(parentweight)
randoms(i) = randi(length(parentweight),1);
subweights(i) = parentweight(randoms(i));
if (sum(subweights)>=n-mean(parentweight)/2)&&(sum(subweights)<=n+mean(parentweight)/2)
broken = 1;
break;
end
end
if broken
break;
end
clear randoms subweights
end
randsubinds{j} = randoms;
clear randoms subweights
end
This is very slow. Anyone have ideas on a better way to do this?
  2 Comments
Jim Hokanson
Jim Hokanson on 2 Jul 2013
Do you care about the uniqueness of the solutions both within a target value and between target values? For the between example would a vector of ids for 20 be fine and that same vector plus a numeric value of 10 be fine for 30 (assuming 10 were actually present). The main point being that the two vectors only differ by a small subset of numbers.
Jon
Jon on 2 Jul 2013
They should be found separately and be more unique than that for sure.

Sign in to comment.

Accepted Answer

Jim Hokanson
Jim Hokanson on 2 Jul 2013
Here's a solution.
Some notes: # Remove calculations from your loops, like the mean of the data, this doesn't change # sorting random data is a nice way of shuffling your indices
N_RAND_PER_ITERATION = 1000; %Make this larger to go quicker, but it also
%takes a lot more memory
N_ITERATIONS_MAX = 10; %You could make this really large to force
%obtaining a solution
N_SOLUTIONS_GOAL = 1000; %You might want to go higher if you want to trim
%the solutions afterwards
%This is just to avoid overflow on the solution cell array
N_SOLUTIONS_MAX = N_SOLUTIONS_GOAL + 1000;
%This roughly matches the statistics in the post
n = 33268;
data = gamrnd(0.05,1,1,n);
target_vectors = 20:10:100;
%This code doesn't need to change
%---------------------------------------------------------
n_targets = length(target_vectors);
solution_count = zeros(1,n_targets); %This keeps track of how many
%valid vectors we've found for each target length
final_solutions = cell(N_SOLUTIONS_MAX,n_targets); %Indices go back into
%the original data vector
%
%i.e. in this example
%temp = final_solutions{1,1}
%sum(temp) => will be about 20
tolerance = mean(data);
for iRand = 1:N_ITERATIONS_MAX
[~,I_shuffle] = sort(rand(N_RAND_PER_ITERATION,n),2);
shuffled_data = data(I_shuffle);
cum_data = cumsum(shuffled_data,2);
for iTarget = 1:n_targets
cur_target = target_vectors(iTarget);
%This is the value we are trying to hit
[min_val,I2] = min(abs(cum_data - cur_target),[],2);
%Find the set of values which bring us closest to the target
valid_entries_I = find(min_val < tolerance);
%Does this closest value come within our tolerance?
n_entries = length(valid_entries_I);
%This bit of code goes through each of the solutions
%and assigns it to the final output
cur_solution_index = solution_count(iTarget);
for iMatch = 1:n_entries
cur_solution_index = cur_solution_index + 1;
cur_entry = valid_entries_I(iMatch);
cur_end = I2(cur_entry);
final_solutions{cur_solution_index,iTarget} = I_shuffle(cur_entry,1:cur_end);
end
solution_count(iTarget) = cur_solution_index;
end
if all(solution_count > N_SOLUTIONS_GOAL)
break
end
end

More Answers (1)

Image Analyst
Image Analyst on 2 Jul 2013
I'm not sure I understand why you think that random elements taken from a vector should ever add up to a specific number, unless they're integers and you just got lucky.
  2 Comments
Jon
Jon on 2 Jul 2013
Edited: Jon on 2 Jul 2013
That's why it's "add up to around the same value" and I use this as criteria for an acceptable subvector if (sum(subweights)>=n-mean(parentweight))&&(sum(subweights)<=n+mean(parentweight)). I changed it slighly than what's above. Based upon the statistics of the values of the parent vector I should be able to get pretty random vectors which sum to aournd 20 or more because these numbers are orders of magnitude larger than the average element of the parent vector.
This weight that I'm summing over is the measure of how many single "events" that value counts as. It can be a lot less than one because this is a Monte Carlo simulation. I need to take a random sampling of a specific number of events. My code works. It's just very slow.
Image Analyst
Image Analyst on 2 Jul 2013
OK, you can check if the sum is within some tolerance value, like shown in the FAQ.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!