Got Questions? Get Answers.
Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Thread Subject:
Parallel Computing Toolbox - Random numbers generation within tasks - a seed issue...

Subject: Parallel Computing Toolbox - Random numbers generation within tasks - a seed issue...

From: Gabriele

Date: 25 Mar, 2013 10:34:07

Message: 1 of 8

Hi All,
I am having some problems in consistently generating random numbers within tasks.
I suppose my problems come from the fact that it is not clear to me how the seed for the stream is handled by the tasks belonging to a job.

So, to make a long story short, and to simplify the problem, I have a job, which comprises some tasks. Each task is generating random numbers. I would like, of course, that:
1) Generated (pseudo-)random numbers are different from task to task (actually also between tasks belonging to different jobs);
2) Generated (pseudo-)random numbers are different if I run the code twice.

Unfortunately, I cannot manage to get both.

If I just make a "plain code" (simply calling, e.g., "rand" from each task) I achieve 1), but I do not achieve 2), i.e. outcomes from tasks are different, but If I run the code twice, I get exactly the same outcomes.

If I try to force the seed (using, e.g., rng('shuffle')) I have the problem that, in some cases, different tasks (typically 2 tasks) seem like starting at the same time (within the accuracy of the "shuffle" algorithm, which seems to be 1/100s looking at randstream.m). As a result, some outcomes are different, while other are the same.

I tried putting a rng('shuffle') command in jobStartup.m and in taskStartup.m, but I couldn't achieve a robust result fulfilling 1) & 2) above. It is not clear to me how an rng(something) command in jobStartup.m affects the tasks

I have also tried passing the seed as a parameter to each task, by creating the seed for each task on the basis of the progressive task's number (say the task ID...). However, this is not very robust, because if you start your code twice, the number of tasks, combined with the time difference of the two runs can lead to partially identically results (this is because in one case you use, e.g., seed=time+task_number, in the second case you use seed=time+delta_time+task_number, and for a given delta_time and two different task_number you could get the same seed).

So, this is the problem.

I post below a code which reproduce the issue, at least on my hardware. In my case the local profile run 4 workers (plus one client) because I have a quad core. Note that the issue does not happen always, so it might be necessary to run the code a few times to see a repetition in the generation.

As you will see, in the task creation there are 4 options. Note that:
- option 4 does not lead to repetition in my case, but results are the same at each run (looks like the starting seed for the tasks is always the same...i.e. 0). So this option is not usable.
- option 3: in my case leads to some repetitions in the generated numbers. So this is not working.
- option 2: can potentially lead to repetitions if the operations within the for-loop are faster than the "shuffle time accuracy". In my case I have not noticed any repetition, so this looks like the preferable option...but I am not 100% sure...a possibility would be to add a pause(0.01) command in the loop (just to be sure), but this is not fantastic...
- option 1: can potentially lead to repetitions between different runs of the code

a global alternative would be to create seed beforehand for each task...

ok, the code is below...

%-------------
%Main script
%
 
%% Identify a cluster:
parallel.defaultClusterProfile('local');
c = parcluster();

%% Create a job
j = createJob(c);

%% Create tasks within a job
%test random number generation
Ntests=6*5;
for jtest=1:Ntests, %create Ntests tasks
 
    %t(jtest)=createTask(j, @f_myrand_with_seed, 1, {[3,1],getfield(rng,'Seed')+jtest}); %option 1: fix the seed from the main script on the basis of the seed of the client
    %t(jtest)=createTask(j, @f_myrand_with_seed, 1, {[3,1],getfield(rng(rng('shuffle')),'Seed')}); %option 2: generate the seed using "shuffle" at this moment
    %t(jtest)=createTask(j, @f_myrand_with_seed, 1, {[3,1],[]}); %option 3: let the function generating the seed internally, using shuffle
    t(jtest)=createTask(j, @f_myrand_with_seed, 1, {[3,1],-1}); %option 4: let the task using the seed it is supposed to use

end;

%% Submit the job to the queue
submit(j);

%% Wait for the job to complete:
wait(j)

%% Get results
results = fetchOutputs(j);

%% Delete the job and permanently remove the job from the scheduler's storage location
delete(j)

%% Check the output
%if two columns are equal, it means the corresponding tasks started from the same
%random seed...which is something not wanted!
fprintf('\nIf two columns are equal, this is bad...')
final_data=[results{:}]
if any(diff(sort(final_data(1,:)))==0), %checking the first line is sufficient in this case
    fprintf('\n...there is a generation problem!\n');
else
    fprintf('\n...this generation seems to be ok!\n');
end;

%----------------------------

%---------------------------
%Additional function

function out=f_myrand_with_seed(dim,sd)

if nargin>1 && ~isempty(sd),
    if sd>0, %change the seed to the required value
        rng(sd);
    end; %note that, when sd<0 we do NOTHING
else
    rng('shuffle'); %use the clock-based seed
end;
out=rand(dim);
%out=rng;out=out.Seed; %use this line to have the seed from the present task
%-----------------------------

thanks for your comments...

bye,
gabriele

Subject: Parallel Computing Toolbox - Random numbers generation within tasks - a seed issue...

From: Yair Altman

Date: 26 Mar, 2013 20:19:14

Message: 2 of 8

If you use a seed of now()*taskNumber*labindex() it should be unique enough to answer all your requirements.

Yair Altman
http://UndocumentedMatlab.com
 

Subject: Parallel Computing Toolbox - Random numbers generation within

From: Peter Perkins

Date: 2 Apr, 2013 13:24:27

Message: 3 of 8

Gabriele, you're doing large-scale parallel simulations. You should be
using the right tools for that. Setting seeds based on current time or
whatever is like throwing darts at a dartboard. You need something more
controlled.

MATLAB includes two random number generators, mrg32k3a and mldfg6331,
that are specifically designed for the kind of thing you're doing. They
both support multiple independent streams and substreams. (the latter is
more or less a lighterweight version of the former). I can't really
follow all of the "topology" that you describe, but by basing the stream
(or substream) index on the tasks, or workers, or runs, you can ensure
that you don't reuse the same random numbers.

This is described at length in a couple of blog posts:

http://blogs.mathworks.com/loren/2008/11/05/new-ways-with-random-numbers-part-i
http://blogs.mathworks.com/loren/2008/11/13/new-ways-with-random-numbers-part-ii

I hope this is helpful.




On 3/25/2013 6:34 AM, Gabriele wrote:
> Hi All,
> I am having some problems in consistently generating random numbers
> within tasks.
> I suppose my problems come from the fact that it is not clear to me how
> the seed for the stream is handled by the tasks belonging to a job.
>
> So, to make a long story short, and to simplify the problem, I have a
> job, which comprises some tasks. Each task is generating random numbers.
> I would like, of course, that:
> 1) Generated (pseudo-)random numbers are different from task to task
> (actually also between tasks belonging to different jobs);
> 2) Generated (pseudo-)random numbers are different if I run the code twice.
>
> Unfortunately, I cannot manage to get both.
> If I just make a "plain code" (simply calling, e.g., "rand" from each
> task) I achieve 1), but I do not achieve 2), i.e. outcomes from tasks
> are different, but If I run the code twice, I get exactly the same
> outcomes.
> If I try to force the seed (using, e.g., rng('shuffle')) I have the
> problem that, in some cases, different tasks (typically 2 tasks) seem
> like starting at the same time (within the accuracy of the "shuffle"
> algorithm, which seems to be 1/100s looking at randstream.m). As a
> result, some outcomes are different, while other are the same.
>
> I tried putting a rng('shuffle') command in jobStartup.m and in
> taskStartup.m, but I couldn't achieve a robust result fulfilling 1) & 2)
> above. It is not clear to me how an rng(something) command in
> jobStartup.m affects the tasks
>
> I have also tried passing the seed as a parameter to each task, by
> creating the seed for each task on the basis of the progressive task's
> number (say the task ID...). However, this is not very robust, because
> if you start your code twice, the number of tasks, combined with the
> time difference of the two runs can lead to partially identically
> results (this is because in one case you use, e.g.,
> seed=time+task_number, in the second case you use
> seed=time+delta_time+task_number, and for a given delta_time and two
> different task_number you could get the same seed).
>
> So, this is the problem.
> I post below a code which reproduce the issue, at least on my hardware.
> In my case the local profile run 4 workers (plus one client) because I
> have a quad core. Note that the issue does not happen always, so it
> might be necessary to run the code a few times to see a repetition in
> the generation.
> As you will see, in the task creation there are 4 options. Note that:
> - option 4 does not lead to repetition in my case, but results are the
> same at each run (looks like the starting seed for the tasks is always
> the same...i.e. 0). So this option is not usable.
> - option 3: in my case leads to some repetitions in the generated
> numbers. So this is not working.
> - option 2: can potentially lead to repetitions if the operations within
> the for-loop are faster than the "shuffle time accuracy". In my case I
> have not noticed any repetition, so this looks like the preferable
> option...but I am not 100% sure...a possibility would be to add a
> pause(0.01) command in the loop (just to be sure), but this is not
> fantastic...
> - option 1: can potentially lead to repetitions between different runs
> of the code
>
> a global alternative would be to create seed beforehand for each task...
>
> ok, the code is below...
>
> %-------------
> %Main script
> %
>
> %% Identify a cluster:
> parallel.defaultClusterProfile('local');
> c = parcluster();
>
> %% Create a job
> j = createJob(c);
>
> %% Create tasks within a job
> %test random number generation
> Ntests=6*5;
> for jtest=1:Ntests, %create Ntests tasks
>
> %t(jtest)=createTask(j, @f_myrand_with_seed, 1,
> {[3,1],getfield(rng,'Seed')+jtest}); %option 1: fix the seed from the
> main script on the basis of the seed of the client
> %t(jtest)=createTask(j, @f_myrand_with_seed, 1,
> {[3,1],getfield(rng(rng('shuffle')),'Seed')}); %option 2: generate the
> seed using "shuffle" at this moment
> %t(jtest)=createTask(j, @f_myrand_with_seed, 1, {[3,1],[]}); %option
> 3: let the function generating the seed internally, using shuffle
> t(jtest)=createTask(j, @f_myrand_with_seed, 1, {[3,1],-1}); %option
> 4: let the task using the seed it is supposed to use
>
> end;
>
> %% Submit the job to the queue
> submit(j);
>
> %% Wait for the job to complete:
> wait(j)
>
> %% Get results
> results = fetchOutputs(j);
>
> %% Delete the job and permanently remove the job from the scheduler's
> storage location
> delete(j)
>
> %% Check the output
> %if two columns are equal, it means the corresponding tasks started from
> the same %random seed...which is something not wanted!
> fprintf('\nIf two columns are equal, this is bad...')
> final_data=[results{:}]
> if any(diff(sort(final_data(1,:)))==0), %checking the first line is
> sufficient in this case
> fprintf('\n...there is a generation problem!\n');
> else
> fprintf('\n...this generation seems to be ok!\n');
> end;
>
> %----------------------------
>
> %---------------------------
> %Additional function
>
> function out=f_myrand_with_seed(dim,sd)
>
> if nargin>1 && ~isempty(sd),
> if sd>0, %change the seed to the required value
> rng(sd);
> end; %note that, when sd<0 we do NOTHING
> else
> rng('shuffle'); %use the clock-based seed
> end;
> out=rand(dim);
> %out=rng;out=out.Seed; %use this line to have the seed from the present
> task
> %-----------------------------
>
> thanks for your comments...
>
> bye,
> gabriele

Subject: Parallel Computing Toolbox - Random numbers generation within

From: Gabriele

Date: 11 Apr, 2013 14:06:06

Message: 4 of 8

Hi Peter,
sorry for the late reply, but I was doing some testing.
Thanks for the suggestions.

I had some exchange of e-mails with the matlab support, and I received some good suggestions on this point.

Such suggestions goes along your line, i.e. using stream and substream.

Now I have mainly two possibilities to select from.

Option A:
Prepare a file taskStartup.m embedding the following code

%------------------
     taskID = task.ID;
     job = task.Parent;
     jobID = job.ID;

     s = RandStream.create('mrg32k3a', 'NumStreams', 2^63, 'Stream', jobID, 'seed', 'shuffle');
     s.Substream = taskID;
     RandStream.setGlobalStream(s);
%------------------

Basically, the stream is selected on the basis of the jobID and the substream is selected on the basis of the taskID. In addition to this, the seed of the stream s is based on the clock time.
My feeling is that should work, because even if the jobId and the taskID are the same for two different calculations (suppose a previous (job,task) was properly deleted), the "shuffle" command should modified the seed, this leading to different generations.

However, in general this approach tends to use always the "low" stream/substream (because usually the jobID and taskID are relatively small numbers comapred to the max number of streams / substreams).

An alternative I was thinking of would be as follows:

Option B:
%-----------------------------------

%get task ID, (parent) job ID and job creation time
taskID = task.ID;
job = task.Parent;
jobID = job.ID;
job_creation_time=job.CreateTime;
  
%convert the job creation time to seconds
job_creation_time=job_creation_time([1:20,26:29]); %remove the time zone
job_creation_time=round(datenum(job_creation_time,'ddd mmm dd HH:MM:SS yyyy')*86400); %convert (units: seconds)
    
%create shift indices using the job creation time:
shift_index_stream=job_creation_time; %shift index for stream
shift_index_substream=round(job_creation_time/1000); %shift index for substream
    
%Now:
%1) Create a large number of independent streams;
%2) Select the stream using shift_index_stream and jobID
%3) Generate also a random seed using "shuffle"
%4) For this particular task use a substream identified by
% shift_index_substream and taskID
%
NS=2^63; %number of multiple independent streams
s = RandStream.create('mrg32k3a', 'NumStreams', NS, 'Stream', shift_index_stream+jobID,'seed','shuffle');
s.Substream = shift_index_substream+taskID;
RandStream.setGlobalStream(s);
%-----------------------------------

The idea is to create a shifting of indices for the stream and substreams. Such shifting is is based on the jobID and the job creation time for the stream, and on the taskID and the job creation time for the substream.
On top of this, "shuffle" is used to create a seed which is based on the task startup time (since taskStartup.m is called at the starting of the task).

This looks like "mixing" things a little bit more (since the index of streams/substreams which is used more diverse). However, I'm not sure this is giving correct statistical properties.

So, the question is: considering that both seems to do the job, is it better using option A or option B?

Thanks,
Gabriele

PS:
I have noticed something looking strange to me in the definition of the class RandStream, where the shuffle algorithm is implemented (function seed = shuffleSeed).

I see the following :
line #733: seed0 = mod(floor(now*8640000),2^31-1);
line #735: seed = mod(floor(now*8640000),2^31-1);

However, considering the seed can be any number smaller than 2^32, I would have expected:
line #733: seed0 = mod(floor(now*8640000),2^32-1);
line #735: seed = mod(floor(now*8640000),2^32-1);

why is the shuffle seed limited to 2^31-1?

Peter Perkins <Peter.Remove.Perkins.This@mathworks.com> wrote in message <kjem6b$e28$1@newscl01ah.mathworks.com>...
> Gabriele, you're doing large-scale parallel simulations. You should be
> using the right tools for that. Setting seeds based on current time or
> whatever is like throwing darts at a dartboard. You need something more
> controlled.
>
> MATLAB includes two random number generators, mrg32k3a and mldfg6331,
> that are specifically designed for the kind of thing you're doing. They
> both support multiple independent streams and substreams. (the latter is
> more or less a lighterweight version of the former). I can't really
> follow all of the "topology" that you describe, but by basing the stream
> (or substream) index on the tasks, or workers, or runs, you can ensure
> that you don't reuse the same random numbers.
>
> This is described at length in a couple of blog posts:
>
> http://blogs.mathworks.com/loren/2008/11/05/new-ways-with-random-numbers-part-i
> http://blogs.mathworks.com/loren/2008/11/13/new-ways-with-random-numbers-part-ii
>
> I hope this is helpful.
>
>
>
>
> On 3/25/2013 6:34 AM, Gabriele wrote:
> > Hi All,
> > I am having some problems in consistently generating random numbers
> > within tasks.
> > I suppose my problems come from the fact that it is not clear to me how
> > the seed for the stream is handled by the tasks belonging to a job.
> >
> > So, to make a long story short, and to simplify the problem, I have a
> > job, which comprises some tasks. Each task is generating random numbers.
> > I would like, of course, that:
> > 1) Generated (pseudo-)random numbers are different from task to task
> > (actually also between tasks belonging to different jobs);
> > 2) Generated (pseudo-)random numbers are different if I run the code twice.
> >
> > Unfortunately, I cannot manage to get both.
> > If I just make a "plain code" (simply calling, e.g., "rand" from each
> > task) I achieve 1), but I do not achieve 2), i.e. outcomes from tasks
> > are different, but If I run the code twice, I get exactly the same
> > outcomes.
> > If I try to force the seed (using, e.g., rng('shuffle')) I have the
> > problem that, in some cases, different tasks (typically 2 tasks) seem
> > like starting at the same time (within the accuracy of the "shuffle"
> > algorithm, which seems to be 1/100s looking at randstream.m). As a
> > result, some outcomes are different, while other are the same.
> >
> > I tried putting a rng('shuffle') command in jobStartup.m and in
> > taskStartup.m, but I couldn't achieve a robust result fulfilling 1) & 2)
> > above. It is not clear to me how an rng(something) command in
> > jobStartup.m affects the tasks
> >
> > I have also tried passing the seed as a parameter to each task, by
> > creating the seed for each task on the basis of the progressive task's
> > number (say the task ID...). However, this is not very robust, because
> > if you start your code twice, the number of tasks, combined with the
> > time difference of the two runs can lead to partially identically
> > results (this is because in one case you use, e.g.,
> > seed=time+task_number, in the second case you use
> > seed=time+delta_time+task_number, and for a given delta_time and two
> > different task_number you could get the same seed).
> >
> > So, this is the problem.
> > I post below a code which reproduce the issue, at least on my hardware.
> > In my case the local profile run 4 workers (plus one client) because I
> > have a quad core. Note that the issue does not happen always, so it
> > might be necessary to run the code a few times to see a repetition in
> > the generation.
> > As you will see, in the task creation there are 4 options. Note that:
> > - option 4 does not lead to repetition in my case, but results are the
> > same at each run (looks like the starting seed for the tasks is always
> > the same...i.e. 0). So this option is not usable.
> > - option 3: in my case leads to some repetitions in the generated
> > numbers. So this is not working.
> > - option 2: can potentially lead to repetitions if the operations within
> > the for-loop are faster than the "shuffle time accuracy". In my case I
> > have not noticed any repetition, so this looks like the preferable
> > option...but I am not 100% sure...a possibility would be to add a
> > pause(0.01) command in the loop (just to be sure), but this is not
> > fantastic...
> > - option 1: can potentially lead to repetitions between different runs
> > of the code
> >
> > a global alternative would be to create seed beforehand for each task...
> >
> > ok, the code is below...
> >
> > %-------------
> > %Main script
> > %
> >
> > %% Identify a cluster:
> > parallel.defaultClusterProfile('local');
> > c = parcluster();
> >
> > %% Create a job
> > j = createJob(c);
> >
> > %% Create tasks within a job
> > %test random number generation
> > Ntests=6*5;
> > for jtest=1:Ntests, %create Ntests tasks
> >
> > %t(jtest)=createTask(j, @f_myrand_with_seed, 1,
> > {[3,1],getfield(rng,'Seed')+jtest}); %option 1: fix the seed from the
> > main script on the basis of the seed of the client
> > %t(jtest)=createTask(j, @f_myrand_with_seed, 1,
> > {[3,1],getfield(rng(rng('shuffle')),'Seed')}); %option 2: generate the
> > seed using "shuffle" at this moment
> > %t(jtest)=createTask(j, @f_myrand_with_seed, 1, {[3,1],[]}); %option
> > 3: let the function generating the seed internally, using shuffle
> > t(jtest)=createTask(j, @f_myrand_with_seed, 1, {[3,1],-1}); %option
> > 4: let the task using the seed it is supposed to use
> >
> > end;
> >
> > %% Submit the job to the queue
> > submit(j);
> >
> > %% Wait for the job to complete:
> > wait(j)
> >
> > %% Get results
> > results = fetchOutputs(j);
> >
> > %% Delete the job and permanently remove the job from the scheduler's
> > storage location
> > delete(j)
> >
> > %% Check the output
> > %if two columns are equal, it means the corresponding tasks started from
> > the same %random seed...which is something not wanted!
> > fprintf('\nIf two columns are equal, this is bad...')
> > final_data=[results{:}]
> > if any(diff(sort(final_data(1,:)))==0), %checking the first line is
> > sufficient in this case
> > fprintf('\n...there is a generation problem!\n');
> > else
> > fprintf('\n...this generation seems to be ok!\n');
> > end;
> >
> > %----------------------------
> >
> > %---------------------------
> > %Additional function
> >
> > function out=f_myrand_with_seed(dim,sd)
> >
> > if nargin>1 && ~isempty(sd),
> > if sd>0, %change the seed to the required value
> > rng(sd);
> > end; %note that, when sd<0 we do NOTHING
> > else
> > rng('shuffle'); %use the clock-based seed
> > end;
> > out=rand(dim);
> > %out=rng;out=out.Seed; %use this line to have the seed from the present
> > task
> > %-----------------------------
> >
> > thanks for your comments...
> >
> > bye,
> > gabriele

Subject: Parallel Computing Toolbox - Random numbers generation within

From: Peter Perkins

Date: 12 Apr, 2013 14:30:40

Message: 5 of 8

>> However, in general this approach tends to use always the "low"
>> stream/substream (because usually the jobID and taskID are relatively
>> small numbers comapred to the max number of streams / substreams).

Why do you care? The proper statistical properties are already built
into the algorithms. Just let that happen.

>> s = RandStream.create('mrg32k3a', 'NumStreams', 2^63, 'Stream', jobID, 'seed', 'shuffle');

It's hard to imagine that with 2^63 streams and 2^51 substreams, you
really need to shuffle the seed. In any case:

 >> help randstream.create
  RandStream.create Create multiple independent random number streams.
[snip]
   'NumStreams', 'StreamIndices', and 'Seed' can be used to ensure
   that multiple streams created at different times are independent.
   Streams of the same type and created using the same value for
   'NumStreams' and 'Seed', but with different values of
   'StreamIndices', are independent even if they were created in
   separate calls to RandStream.create.

The converse of that, probably not stated clearly enough (I will make a
note to have that improved), is that if you use different seeds for the
parallel generators, then all bets are off as far as independence goes.
You are perhaps OK, but mrg32k3a was designed to use achieve
independence using streams/substreams with the the same seed.


On 4/11/2013 10:06 AM, Gabriele wrote:
> Hi Peter,
> sorry for the late reply, but I was doing some testing.
> Thanks for the suggestions.
>
> I had some exchange of e-mails with the matlab support, and I received
> some good suggestions on this point.
>
> Such suggestions goes along your line, i.e. using stream and substream.
>
> Now I have mainly two possibilities to select from.
>
> Option A:
> Prepare a file taskStartup.m embedding the following code
>
> %------------------
> taskID = task.ID;
> job = task.Parent;
> jobID = job.ID;
>
> s = RandStream.create('mrg32k3a', 'NumStreams', 2^63, 'Stream',
> jobID, 'seed', 'shuffle');
> s.Substream = taskID;
> RandStream.setGlobalStream(s); %------------------
>
> Basically, the stream is selected on the basis of the jobID and the
> substream is selected on the basis of the taskID. In addition to this,
> the seed of the stream s is based on the clock time.
> My feeling is that should work, because even if the jobId and the taskID
> are the same for two different calculations (suppose a previous
> (job,task) was properly deleted), the "shuffle" command should modified
> the seed, this leading to different generations.
>
> However, in general this approach tends to use always the "low"
> stream/substream (because usually the jobID and taskID are relatively
> small numbers comapred to the max number of streams / substreams).
>
> An alternative I was thinking of would be as follows:
>
> Option B: %-----------------------------------
>
> %get task ID, (parent) job ID and job creation time
> taskID = task.ID;
> job = task.Parent;
> jobID = job.ID;
> job_creation_time=job.CreateTime;
>
> %convert the job creation time to seconds
> job_creation_time=job_creation_time([1:20,26:29]); %remove the time zone
> job_creation_time=round(datenum(job_creation_time,'ddd mmm dd HH:MM:SS
> yyyy')*86400); %convert (units: seconds)
> %create shift indices using the job creation time:
> shift_index_stream=job_creation_time; %shift index for stream
> shift_index_substream=round(job_creation_time/1000); %shift index for
> substream
> %Now:
> %1) Create a large number of independent streams;
> %2) Select the stream using shift_index_stream and jobID
> %3) Generate also a random seed using "shuffle"
> %4) For this particular task use a substream identified by
> % shift_index_substream and taskID
> %
> NS=2^63; %number of multiple independent streams
> s = RandStream.create('mrg32k3a', 'NumStreams', NS, 'Stream',
> shift_index_stream+jobID,'seed','shuffle');
> s.Substream = shift_index_substream+taskID;
> RandStream.setGlobalStream(s);
> %-----------------------------------
>
> The idea is to create a shifting of indices for the stream and
> substreams. Such shifting is is based on the jobID and the job creation
> time for the stream, and on the taskID and the job creation time for the
> substream. On top of this, "shuffle" is used to create a seed which is
> based on the task startup time (since taskStartup.m is called at the
> starting of the task).
> This looks like "mixing" things a little bit more (since the index of
> streams/substreams which is used more diverse). However, I'm not sure
> this is giving correct statistical properties.
>
> So, the question is: considering that both seems to do the job, is it
> better using option A or option B?
>
> Thanks,
> Gabriele
>
> PS: I have noticed something looking strange to me in the definition of
> the class RandStream, where the shuffle algorithm is implemented
> (function seed = shuffleSeed).
> I see the following :
> line #733: seed0 = mod(floor(now*8640000),2^31-1);
> line #735: seed = mod(floor(now*8640000),2^31-1);
>
> However, considering the seed can be any number smaller than 2^32, I
> would have expected:
> line #733: seed0 = mod(floor(now*8640000),2^32-1);
> line #735: seed = mod(floor(now*8640000),2^32-1);
>
> why is the shuffle seed limited to 2^31-1?
>
> Peter Perkins <Peter.Remove.Perkins.This@mathworks.com> wrote in message
> <kjem6b$e28$1@newscl01ah.mathworks.com>...
>> Gabriele, you're doing large-scale parallel simulations. You should be
>> using the right tools for that. Setting seeds based on current time or
>> whatever is like throwing darts at a dartboard. You need something
>> more controlled.
>>
>> MATLAB includes two random number generators, mrg32k3a and mldfg6331,
>> that are specifically designed for the kind of thing you're doing.
>> They both support multiple independent streams and substreams. (the
>> latter is more or less a lighterweight version of the former). I can't
>> really follow all of the "topology" that you describe, but by basing
>> the stream (or substream) index on the tasks, or workers, or runs, you
>> can ensure that you don't reuse the same random numbers.
>>
>> This is described at length in a couple of blog posts:
>>
>> http://blogs.mathworks.com/loren/2008/11/05/new-ways-with-random-numbers-part-i
>>
>> http://blogs.mathworks.com/loren/2008/11/13/new-ways-with-random-numbers-part-ii
>>
>>
>> I hope this is helpful.
>>
>>
>>
>>
>> On 3/25/2013 6:34 AM, Gabriele wrote:
>> > Hi All,
>> > I am having some problems in consistently generating random numbers
>> > within tasks.
>> > I suppose my problems come from the fact that it is not clear to me how
>> > the seed for the stream is handled by the tasks belonging to a job.
>> >
>> > So, to make a long story short, and to simplify the problem, I have a
>> > job, which comprises some tasks. Each task is generating random
>> numbers.
>> > I would like, of course, that:
>> > 1) Generated (pseudo-)random numbers are different from task to task
>> > (actually also between tasks belonging to different jobs);
>> > 2) Generated (pseudo-)random numbers are different if I run the code
>> twice.
>> >
>> > Unfortunately, I cannot manage to get both.
>> > If I just make a "plain code" (simply calling, e.g., "rand" from each
>> > task) I achieve 1), but I do not achieve 2), i.e. outcomes from tasks
>> > are different, but If I run the code twice, I get exactly the same
>> > outcomes.
>> > If I try to force the seed (using, e.g., rng('shuffle')) I have the
>> > problem that, in some cases, different tasks (typically 2 tasks) seem
>> > like starting at the same time (within the accuracy of the "shuffle"
>> > algorithm, which seems to be 1/100s looking at randstream.m). As a
>> > result, some outcomes are different, while other are the same.
>> >
>> > I tried putting a rng('shuffle') command in jobStartup.m and in
>> > taskStartup.m, but I couldn't achieve a robust result fulfilling 1)
>> & 2)
>> > above. It is not clear to me how an rng(something) command in
>> > jobStartup.m affects the tasks
>> >
>> > I have also tried passing the seed as a parameter to each task, by
>> > creating the seed for each task on the basis of the progressive task's
>> > number (say the task ID...). However, this is not very robust, because
>> > if you start your code twice, the number of tasks, combined with the
>> > time difference of the two runs can lead to partially identically
>> > results (this is because in one case you use, e.g.,
>> > seed=time+task_number, in the second case you use
>> > seed=time+delta_time+task_number, and for a given delta_time and two
>> > different task_number you could get the same seed).
>> >
>> > So, this is the problem.
>> > I post below a code which reproduce the issue, at least on my hardware.
>> > In my case the local profile run 4 workers (plus one client) because I
>> > have a quad core. Note that the issue does not happen always, so it
>> > might be necessary to run the code a few times to see a repetition in
>> > the generation.
>> > As you will see, in the task creation there are 4 options. Note that:
>> > - option 4 does not lead to repetition in my case, but results are the
>> > same at each run (looks like the starting seed for the tasks is always
>> > the same...i.e. 0). So this option is not usable.
>> > - option 3: in my case leads to some repetitions in the generated
>> > numbers. So this is not working.
>> > - option 2: can potentially lead to repetitions if the operations
>> within
>> > the for-loop are faster than the "shuffle time accuracy". In my case I
>> > have not noticed any repetition, so this looks like the preferable
>> > option...but I am not 100% sure...a possibility would be to add a
>> > pause(0.01) command in the loop (just to be sure), but this is not
>> > fantastic...
>> > - option 1: can potentially lead to repetitions between different runs
>> > of the code
>> >
>> > a global alternative would be to create seed beforehand for each
>> task...
>> >
>> > ok, the code is below...
>> >
>> > %-------------
>> > %Main script
>> > %
>> >
>> > %% Identify a cluster:
>> > parallel.defaultClusterProfile('local');
>> > c = parcluster();
>> >
>> > %% Create a job
>> > j = createJob(c);
>> >
>> > %% Create tasks within a job
>> > %test random number generation
>> > Ntests=6*5;
>> > for jtest=1:Ntests, %create Ntests tasks
>> >
>> > %t(jtest)=createTask(j, @f_myrand_with_seed, 1,
>> > {[3,1],getfield(rng,'Seed')+jtest}); %option 1: fix the seed from the
>> > main script on the basis of the seed of the client
>> > %t(jtest)=createTask(j, @f_myrand_with_seed, 1,
>> > {[3,1],getfield(rng(rng('shuffle')),'Seed')}); %option 2: generate the
>> > seed using "shuffle" at this moment
>> > %t(jtest)=createTask(j, @f_myrand_with_seed, 1, {[3,1],[]});
>> %option
>> > 3: let the function generating the seed internally, using shuffle
>> > t(jtest)=createTask(j, @f_myrand_with_seed, 1, {[3,1],-1}); %option
>> > 4: let the task using the seed it is supposed to use
>> >
>> > end;
>> >
>> > %% Submit the job to the queue
>> > submit(j);
>> >
>> > %% Wait for the job to complete:
>> > wait(j)
>> >
>> > %% Get results
>> > results = fetchOutputs(j);
>> >
>> > %% Delete the job and permanently remove the job from the scheduler's
>> > storage location
>> > delete(j)
>> >
>> > %% Check the output
>> > %if two columns are equal, it means the corresponding tasks started
>> from
>> > the same %random seed...which is something not wanted!
>> > fprintf('\nIf two columns are equal, this is bad...')
>> > final_data=[results{:}]
>> > if any(diff(sort(final_data(1,:)))==0), %checking the first line is
>> > sufficient in this case
>> > fprintf('\n...there is a generation problem!\n');
>> > else
>> > fprintf('\n...this generation seems to be ok!\n');
>> > end;
>> >
>> > %----------------------------
>> >
>> > %---------------------------
>> > %Additional function
>> >
>> > function out=f_myrand_with_seed(dim,sd)
>> >
>> > if nargin>1 && ~isempty(sd),
>> > if sd>0, %change the seed to the required value
>> > rng(sd);
>> > end; %note that, when sd<0 we do NOTHING
>> > else
>> > rng('shuffle'); %use the clock-based seed
>> > end;
>> > out=rand(dim);
>> > %out=rng;out=out.Seed; %use this line to have the seed from the present
>> > task
>> > %-----------------------------
>> >
>> > thanks for your comments...
>> >
>> > bye,
>> > gabriele

Subject: Parallel Computing Toolbox - Random numbers generation within

From: Gabriele

Date: 13 Apr, 2013 14:05:08

Message: 6 of 8

Dear Peter,
the reason why I need of shuffle the seed is very simple.

Suppose that the code work this way:
1) First it generates one job with two tasks
2) The job is then submitted
3) After completion, the job is deleted

In such a case the job, for instance, has jobID=1, then the taskID are 1 and 2.
The stream index is then 1, with substream indices 1 and two for the two tasks respectively.

When the job is deleted, the jobID is removed. This means that, if I run again the same code, the new jobID can be, again, 1, with taskID 1 and 2 (again).

In such a situation (which is actually my real situation), if you don't use a shuffle the random number generator will generate again exactly the same random numbers for task 1 and task 2.

As a result, without shuffle, whenever the couple (jobID,taskID) is the same, the same random numbers are generated.

If you work with the local cluster profile, it is common you delete your jobs after completion and after gathering results. This means that it is not uncommon you start your new set of jobs from jobID=1. As a result, without using shuffle, it is very common you always generate the same random number series.

In addition, for the same reason, it is very common (I would say almost sure) that the stream indices and substream indices you are going to use will be the low ones (I cannot imagine a relatively standard system where the jobID/taskID have reached order of magnitudes of, e.g., 2^30...). For this reason I was proposing the option B which tends to sift the stream and substream indices towards higher values.

What do you think?

Moreover, any comment on the 2^31-1 matter in the shuffle algorithm in the RandStream class?

Thanks,
Gabriele

Peter Perkins <Peter.Remove.Perkins.This@mathworks.com> wrote in message <kk95qg$e76$1@newscl01ah.mathworks.com>...
> >> However, in general this approach tends to use always the "low"
> >> stream/substream (because usually the jobID and taskID are relatively
> >> small numbers comapred to the max number of streams / substreams).
>
> Why do you care? The proper statistical properties are already built
> into the algorithms. Just let that happen.
>
> >> s = RandStream.create('mrg32k3a', 'NumStreams', 2^63, 'Stream', jobID, 'seed', 'shuffle');
>
> It's hard to imagine that with 2^63 streams and 2^51 substreams, you
> really need to shuffle the seed. In any case:
>
> >> help randstream.create
> RandStream.create Create multiple independent random number streams.
> [snip]
> 'NumStreams', 'StreamIndices', and 'Seed' can be used to ensure
> that multiple streams created at different times are independent.
> Streams of the same type and created using the same value for
> 'NumStreams' and 'Seed', but with different values of
> 'StreamIndices', are independent even if they were created in
> separate calls to RandStream.create.
>
> The converse of that, probably not stated clearly enough (I will make a
> note to have that improved), is that if you use different seeds for the
> parallel generators, then all bets are off as far as independence goes.
> You are perhaps OK, but mrg32k3a was designed to use achieve
> independence using streams/substreams with the the same seed.
>
>
> On 4/11/2013 10:06 AM, Gabriele wrote:
> > Hi Peter,
> > sorry for the late reply, but I was doing some testing.
> > Thanks for the suggestions.
> >
> > I had some exchange of e-mails with the matlab support, and I received
> > some good suggestions on this point.
> >
> > Such suggestions goes along your line, i.e. using stream and substream.
> >
> > Now I have mainly two possibilities to select from.
> >
> > Option A:
> > Prepare a file taskStartup.m embedding the following code
> >
> > %------------------
> > taskID = task.ID;
> > job = task.Parent;
> > jobID = job.ID;
> >
> > s = RandStream.create('mrg32k3a', 'NumStreams', 2^63, 'Stream',
> > jobID, 'seed', 'shuffle');
> > s.Substream = taskID;
> > RandStream.setGlobalStream(s); %------------------
> >
> > Basically, the stream is selected on the basis of the jobID and the
> > substream is selected on the basis of the taskID. In addition to this,
> > the seed of the stream s is based on the clock time.
> > My feeling is that should work, because even if the jobId and the taskID
> > are the same for two different calculations (suppose a previous
> > (job,task) was properly deleted), the "shuffle" command should modified
> > the seed, this leading to different generations.
> >
> > However, in general this approach tends to use always the "low"
> > stream/substream (because usually the jobID and taskID are relatively
> > small numbers comapred to the max number of streams / substreams).
> >
> > An alternative I was thinking of would be as follows:
> >
> > Option B: %-----------------------------------
> >
> > %get task ID, (parent) job ID and job creation time
> > taskID = task.ID;
> > job = task.Parent;
> > jobID = job.ID;
> > job_creation_time=job.CreateTime;
> >
> > %convert the job creation time to seconds
> > job_creation_time=job_creation_time([1:20,26:29]); %remove the time zone
> > job_creation_time=round(datenum(job_creation_time,'ddd mmm dd HH:MM:SS
> > yyyy')*86400); %convert (units: seconds)
> > %create shift indices using the job creation time:
> > shift_index_stream=job_creation_time; %shift index for stream
> > shift_index_substream=round(job_creation_time/1000); %shift index for
> > substream
> > %Now:
> > %1) Create a large number of independent streams;
> > %2) Select the stream using shift_index_stream and jobID
> > %3) Generate also a random seed using "shuffle"
> > %4) For this particular task use a substream identified by
> > % shift_index_substream and taskID
> > %
> > NS=2^63; %number of multiple independent streams
> > s = RandStream.create('mrg32k3a', 'NumStreams', NS, 'Stream',
> > shift_index_stream+jobID,'seed','shuffle');
> > s.Substream = shift_index_substream+taskID;
> > RandStream.setGlobalStream(s);
> > %-----------------------------------
> >
> > The idea is to create a shifting of indices for the stream and
> > substreams. Such shifting is is based on the jobID and the job creation
> > time for the stream, and on the taskID and the job creation time for the
> > substream. On top of this, "shuffle" is used to create a seed which is
> > based on the task startup time (since taskStartup.m is called at the
> > starting of the task).
> > This looks like "mixing" things a little bit more (since the index of
> > streams/substreams which is used more diverse). However, I'm not sure
> > this is giving correct statistical properties.
> >
> > So, the question is: considering that both seems to do the job, is it
> > better using option A or option B?
> >
> > Thanks,
> > Gabriele
> >
> > PS: I have noticed something looking strange to me in the definition of
> > the class RandStream, where the shuffle algorithm is implemented
> > (function seed = shuffleSeed).
> > I see the following :
> > line #733: seed0 = mod(floor(now*8640000),2^31-1);
> > line #735: seed = mod(floor(now*8640000),2^31-1);
> >
> > However, considering the seed can be any number smaller than 2^32, I
> > would have expected:
> > line #733: seed0 = mod(floor(now*8640000),2^32-1);
> > line #735: seed = mod(floor(now*8640000),2^32-1);
> >
> > why is the shuffle seed limited to 2^31-1?
> >
> > Peter Perkins <Peter.Remove.Perkins.This@mathworks.com> wrote in message
> > <kjem6b$e28$1@newscl01ah.mathworks.com>...
> >> Gabriele, you're doing large-scale parallel simulations. You should be
> >> using the right tools for that. Setting seeds based on current time or
> >> whatever is like throwing darts at a dartboard. You need something
> >> more controlled.
> >>
> >> MATLAB includes two random number generators, mrg32k3a and mldfg6331,
> >> that are specifically designed for the kind of thing you're doing.
> >> They both support multiple independent streams and substreams. (the
> >> latter is more or less a lighterweight version of the former). I can't
> >> really follow all of the "topology" that you describe, but by basing
> >> the stream (or substream) index on the tasks, or workers, or runs, you
> >> can ensure that you don't reuse the same random numbers.
> >>
> >> This is described at length in a couple of blog posts:
> >>
> >> http://blogs.mathworks.com/loren/2008/11/05/new-ways-with-random-numbers-part-i
> >>
> >> http://blogs.mathworks.com/loren/2008/11/13/new-ways-with-random-numbers-part-ii
> >>
> >>
> >> I hope this is helpful.
> >>
> >>
> >>
> >>
> >> On 3/25/2013 6:34 AM, Gabriele wrote:
> >> > Hi All,
> >> > I am having some problems in consistently generating random numbers
> >> > within tasks.
> >> > I suppose my problems come from the fact that it is not clear to me how
> >> > the seed for the stream is handled by the tasks belonging to a job.
> >> >
> >> > So, to make a long story short, and to simplify the problem, I have a
> >> > job, which comprises some tasks. Each task is generating random
> >> numbers.
> >> > I would like, of course, that:
> >> > 1) Generated (pseudo-)random numbers are different from task to task
> >> > (actually also between tasks belonging to different jobs);
> >> > 2) Generated (pseudo-)random numbers are different if I run the code
> >> twice.
> >> >
> >> > Unfortunately, I cannot manage to get both.
> >> > If I just make a "plain code" (simply calling, e.g., "rand" from each
> >> > task) I achieve 1), but I do not achieve 2), i.e. outcomes from tasks
> >> > are different, but If I run the code twice, I get exactly the same
> >> > outcomes.
> >> > If I try to force the seed (using, e.g., rng('shuffle')) I have the
> >> > problem that, in some cases, different tasks (typically 2 tasks) seem
> >> > like starting at the same time (within the accuracy of the "shuffle"
> >> > algorithm, which seems to be 1/100s looking at randstream.m). As a
> >> > result, some outcomes are different, while other are the same.
> >> >
> >> > I tried putting a rng('shuffle') command in jobStartup.m and in
> >> > taskStartup.m, but I couldn't achieve a robust result fulfilling 1)
> >> & 2)
> >> > above. It is not clear to me how an rng(something) command in
> >> > jobStartup.m affects the tasks
> >> >
> >> > I have also tried passing the seed as a parameter to each task, by
> >> > creating the seed for each task on the basis of the progressive task's
> >> > number (say the task ID...). However, this is not very robust, because
> >> > if you start your code twice, the number of tasks, combined with the
> >> > time difference of the two runs can lead to partially identically
> >> > results (this is because in one case you use, e.g.,
> >> > seed=time+task_number, in the second case you use
> >> > seed=time+delta_time+task_number, and for a given delta_time and two
> >> > different task_number you could get the same seed).
> >> >
> >> > So, this is the problem.
> >> > I post below a code which reproduce the issue, at least on my hardware.
> >> > In my case the local profile run 4 workers (plus one client) because I
> >> > have a quad core. Note that the issue does not happen always, so it
> >> > might be necessary to run the code a few times to see a repetition in
> >> > the generation.
> >> > As you will see, in the task creation there are 4 options. Note that:
> >> > - option 4 does not lead to repetition in my case, but results are the
> >> > same at each run (looks like the starting seed for the tasks is always
> >> > the same...i.e. 0). So this option is not usable.
> >> > - option 3: in my case leads to some repetitions in the generated
> >> > numbers. So this is not working.
> >> > - option 2: can potentially lead to repetitions if the operations
> >> within
> >> > the for-loop are faster than the "shuffle time accuracy". In my case I
> >> > have not noticed any repetition, so this looks like the preferable
> >> > option...but I am not 100% sure...a possibility would be to add a
> >> > pause(0.01) command in the loop (just to be sure), but this is not
> >> > fantastic...
> >> > - option 1: can potentially lead to repetitions between different runs
> >> > of the code
> >> >
> >> > a global alternative would be to create seed beforehand for each
> >> task...
> >> >
> >> > ok, the code is below...
> >> >
> >> > %-------------
> >> > %Main script
> >> > %
> >> >
> >> > %% Identify a cluster:
> >> > parallel.defaultClusterProfile('local');
> >> > c = parcluster();
> >> >
> >> > %% Create a job
> >> > j = createJob(c);
> >> >
> >> > %% Create tasks within a job
> >> > %test random number generation
> >> > Ntests=6*5;
> >> > for jtest=1:Ntests, %create Ntests tasks
> >> >
> >> > %t(jtest)=createTask(j, @f_myrand_with_seed, 1,
> >> > {[3,1],getfield(rng,'Seed')+jtest}); %option 1: fix the seed from the
> >> > main script on the basis of the seed of the client
> >> > %t(jtest)=createTask(j, @f_myrand_with_seed, 1,
> >> > {[3,1],getfield(rng(rng('shuffle')),'Seed')}); %option 2: generate the
> >> > seed using "shuffle" at this moment
> >> > %t(jtest)=createTask(j, @f_myrand_with_seed, 1, {[3,1],[]});
> >> %option
> >> > 3: let the function generating the seed internally, using shuffle
> >> > t(jtest)=createTask(j, @f_myrand_with_seed, 1, {[3,1],-1}); %option
> >> > 4: let the task using the seed it is supposed to use
> >> >
> >> > end;
> >> >
> >> > %% Submit the job to the queue
> >> > submit(j);
> >> >
> >> > %% Wait for the job to complete:
> >> > wait(j)
> >> >
> >> > %% Get results
> >> > results = fetchOutputs(j);
> >> >
> >> > %% Delete the job and permanently remove the job from the scheduler's
> >> > storage location
> >> > delete(j)
> >> >
> >> > %% Check the output
> >> > %if two columns are equal, it means the corresponding tasks started
> >> from
> >> > the same %random seed...which is something not wanted!
> >> > fprintf('\nIf two columns are equal, this is bad...')
> >> > final_data=[results{:}]
> >> > if any(diff(sort(final_data(1,:)))==0), %checking the first line is
> >> > sufficient in this case
> >> > fprintf('\n...there is a generation problem!\n');
> >> > else
> >> > fprintf('\n...this generation seems to be ok!\n');
> >> > end;
> >> >
> >> > %----------------------------
> >> >
> >> > %---------------------------
> >> > %Additional function
> >> >
> >> > function out=f_myrand_with_seed(dim,sd)
> >> >
> >> > if nargin>1 && ~isempty(sd),
> >> > if sd>0, %change the seed to the required value
> >> > rng(sd);
> >> > end; %note that, when sd<0 we do NOTHING
> >> > else
> >> > rng('shuffle'); %use the clock-based seed
> >> > end;
> >> > out=rand(dim);
> >> > %out=rng;out=out.Seed; %use this line to have the seed from the present
> >> > task
> >> > %-----------------------------
> >> >
> >> > thanks for your comments...
> >> >
> >> > bye,
> >> > gabriele

Subject: Parallel Computing Toolbox - Random numbers generation within

From: Peter Perkins

Date: 16 Apr, 2013 01:22:51

Message: 7 of 8

If you're doing that many parallel simulations, you should be using the
generator in the way in which it was designed. It has parallelism
designed into it in the form of 2^53*2^64 substreams to choose from.
The mrg32k3a generator was not designed to be parallelized via seeds.

Will it matter? Who can say. But the generator was tested to verify that
it gives statistical independence between streams and substreams, not
between different seeds.


On 4/13/2013 10:05 AM, Gabriele wrote:
> Dear Peter,
> the reason why I need of shuffle the seed is very simple.
>
> Suppose that the code work this way:
> 1) First it generates one job with two tasks
> 2) The job is then submitted
> 3) After completion, the job is deleted
>
> In such a case the job, for instance, has jobID=1, then the taskID are 1
> and 2.
> The stream index is then 1, with substream indices 1 and two for the two
> tasks respectively.
>
> When the job is deleted, the jobID is removed. This means that, if I run
> again the same code, the new jobID can be, again, 1, with taskID 1 and 2
> (again).
>
> In such a situation (which is actually my real situation), if you don't
> use a shuffle the random number generator will generate again exactly
> the same random numbers for task 1 and task 2.
>
> As a result, without shuffle, whenever the couple (jobID,taskID) is the
> same, the same random numbers are generated.
> If you work with the local cluster profile, it is common you delete your
> jobs after completion and after gathering results. This means that it is
> not uncommon you start your new set of jobs from jobID=1. As a result,
> without using shuffle, it is very common you always generate the same
> random number series.
>
> In addition, for the same reason, it is very common (I would say almost
> sure) that the stream indices and substream indices you are going to use
> will be the low ones (I cannot imagine a relatively standard system
> where the jobID/taskID have reached order of magnitudes of, e.g.,
> 2^30...). For this reason I was proposing the option B which tends to
> sift the stream and substream indices towards higher values.
>
> What do you think?
>
> Moreover, any comment on the 2^31-1 matter in the shuffle algorithm in
> the RandStream class?
>
> Thanks,
> Gabriele
>
> Peter Perkins <Peter.Remove.Perkins.This@mathworks.com> wrote in message
> <kk95qg$e76$1@newscl01ah.mathworks.com>...
>> >> However, in general this approach tends to use always the "low"
>> >> stream/substream (because usually the jobID and taskID are relatively
>> >> small numbers comapred to the max number of streams / substreams).
>>
>> Why do you care? The proper statistical properties are already built
>> into the algorithms. Just let that happen.
>>
>> >> s = RandStream.create('mrg32k3a', 'NumStreams', 2^63, 'Stream',
>> jobID, 'seed', 'shuffle');
>>
>> It's hard to imagine that with 2^63 streams and 2^51 substreams, you
>> really need to shuffle the seed. In any case:
>>
>> >> help randstream.create
>> RandStream.create Create multiple independent random number streams.
>> [snip]
>> 'NumStreams', 'StreamIndices', and 'Seed' can be used to ensure
>> that multiple streams created at different times are independent.
>> Streams of the same type and created using the same value for
>> 'NumStreams' and 'Seed', but with different values of
>> 'StreamIndices', are independent even if they were created in
>> separate calls to RandStream.create.
>>
>> The converse of that, probably not stated clearly enough (I will make
>> a note to have that improved), is that if you use different seeds for
>> the parallel generators, then all bets are off as far as independence
>> goes. You are perhaps OK, but mrg32k3a was designed to use achieve
>> independence using streams/substreams with the the same seed.
>>
>>
>> On 4/11/2013 10:06 AM, Gabriele wrote:
>> > Hi Peter,
>> > sorry for the late reply, but I was doing some testing.
>> > Thanks for the suggestions.
>> >
>> > I had some exchange of e-mails with the matlab support, and I received
>> > some good suggestions on this point.
>> >
>> > Such suggestions goes along your line, i.e. using stream and substream.
>> >
>> > Now I have mainly two possibilities to select from.
>> >
>> > Option A:
>> > Prepare a file taskStartup.m embedding the following code
>> >
>> > %------------------
>> > taskID = task.ID;
>> > job = task.Parent;
>> > jobID = job.ID;
>> >
>> > s = RandStream.create('mrg32k3a', 'NumStreams', 2^63, 'Stream',
>> > jobID, 'seed', 'shuffle');
>> > s.Substream = taskID;
>> > RandStream.setGlobalStream(s); %------------------
>> >
>> > Basically, the stream is selected on the basis of the jobID and the
>> > substream is selected on the basis of the taskID. In addition to this,
>> > the seed of the stream s is based on the clock time.
>> > My feeling is that should work, because even if the jobId and the
>> taskID
>> > are the same for two different calculations (suppose a previous
>> > (job,task) was properly deleted), the "shuffle" command should modified
>> > the seed, this leading to different generations.
>> >
>> > However, in general this approach tends to use always the "low"
>> > stream/substream (because usually the jobID and taskID are relatively
>> > small numbers comapred to the max number of streams / substreams).
>> >
>> > An alternative I was thinking of would be as follows:
>> >
>> > Option B: %-----------------------------------
>> >
>> > %get task ID, (parent) job ID and job creation time
>> > taskID = task.ID;
>> > job = task.Parent;
>> > jobID = job.ID;
>> > job_creation_time=job.CreateTime;
>> >
>> > %convert the job creation time to seconds
>> > job_creation_time=job_creation_time([1:20,26:29]); %remove the time
>> zone
>> > job_creation_time=round(datenum(job_creation_time,'ddd mmm dd HH:MM:SS
>> > yyyy')*86400); %convert (units: seconds)
>> > %create shift indices using the job creation time:
>> > shift_index_stream=job_creation_time; %shift index for stream
>> > shift_index_substream=round(job_creation_time/1000); %shift index for
>> > substream
>> > %Now:
>> > %1) Create a large number of independent streams;
>> > %2) Select the stream using shift_index_stream and jobID
>> > %3) Generate also a random seed using "shuffle"
>> > %4) For this particular task use a substream identified by
>> > % shift_index_substream and taskID
>> > %
>> > NS=2^63; %number of multiple independent streams
>> > s = RandStream.create('mrg32k3a', 'NumStreams', NS, 'Stream',
>> > shift_index_stream+jobID,'seed','shuffle');
>> > s.Substream = shift_index_substream+taskID;
>> > RandStream.setGlobalStream(s);
>> > %-----------------------------------
>> >
>> > The idea is to create a shifting of indices for the stream and
>> > substreams. Such shifting is is based on the jobID and the job creation
>> > time for the stream, and on the taskID and the job creation time for
>> the
>> > substream. On top of this, "shuffle" is used to create a seed which is
>> > based on the task startup time (since taskStartup.m is called at the
>> > starting of the task).
>> > This looks like "mixing" things a little bit more (since the index of
>> > streams/substreams which is used more diverse). However, I'm not sure
>> > this is giving correct statistical properties.
>> >
>> > So, the question is: considering that both seems to do the job, is it
>> > better using option A or option B?
>> >
>> > Thanks,
>> > Gabriele
>> >
>> > PS: I have noticed something looking strange to me in the definition of
>> > the class RandStream, where the shuffle algorithm is implemented
>> > (function seed = shuffleSeed).
>> > I see the following :
>> > line #733: seed0 = mod(floor(now*8640000),2^31-1);
>> > line #735: seed = mod(floor(now*8640000),2^31-1);
>> >
>> > However, considering the seed can be any number smaller than 2^32, I
>> > would have expected:
>> > line #733: seed0 = mod(floor(now*8640000),2^32-1);
>> > line #735: seed = mod(floor(now*8640000),2^32-1);
>> >
>> > why is the shuffle seed limited to 2^31-1?
>> >
>> > Peter Perkins <Peter.Remove.Perkins.This@mathworks.com> wrote in
>> message
>> > <kjem6b$e28$1@newscl01ah.mathworks.com>...
>> >> Gabriele, you're doing large-scale parallel simulations. You should be
>> >> using the right tools for that. Setting seeds based on current time or
>> >> whatever is like throwing darts at a dartboard. You need something
>> >> more controlled.
>> >>
>> >> MATLAB includes two random number generators, mrg32k3a and mldfg6331,
>> >> that are specifically designed for the kind of thing you're doing.
>> >> They both support multiple independent streams and substreams. (the
>> >> latter is more or less a lighterweight version of the former). I can't
>> >> really follow all of the "topology" that you describe, but by basing
>> >> the stream (or substream) index on the tasks, or workers, or runs, you
>> >> can ensure that you don't reuse the same random numbers.
>> >>
>> >> This is described at length in a couple of blog posts:
>> >>
>> >>
>> http://blogs.mathworks.com/loren/2008/11/05/new-ways-with-random-numbers-part-i
>>
>> >>
>> >>
>> http://blogs.mathworks.com/loren/2008/11/13/new-ways-with-random-numbers-part-ii
>>
>> >>
>> >>
>> >> I hope this is helpful.
>> >>
>> >>
>> >>
>> >>
>> >> On 3/25/2013 6:34 AM, Gabriele wrote:
>> >> > Hi All,
>> >> > I am having some problems in consistently generating random numbers
>> >> > within tasks.
>> >> > I suppose my problems come from the fact that it is not clear to
>> me how
>> >> > the seed for the stream is handled by the tasks belonging to a job.
>> >> >
>> >> > So, to make a long story short, and to simplify the problem, I
>> have a
>> >> > job, which comprises some tasks. Each task is generating random
>> >> numbers.
>> >> > I would like, of course, that:
>> >> > 1) Generated (pseudo-)random numbers are different from task to task
>> >> > (actually also between tasks belonging to different jobs);
>> >> > 2) Generated (pseudo-)random numbers are different if I run the code
>> >> twice.
>> >> >
>> >> > Unfortunately, I cannot manage to get both.
>> >> > If I just make a "plain code" (simply calling, e.g., "rand" from
>> each
>> >> > task) I achieve 1), but I do not achieve 2), i.e. outcomes from
>> tasks
>> >> > are different, but If I run the code twice, I get exactly the same
>> >> > outcomes.
>> >> > If I try to force the seed (using, e.g., rng('shuffle')) I have the
>> >> > problem that, in some cases, different tasks (typically 2 tasks)
>> seem
>> >> > like starting at the same time (within the accuracy of the "shuffle"
>> >> > algorithm, which seems to be 1/100s looking at randstream.m). As a
>> >> > result, some outcomes are different, while other are the same.
>> >> >
>> >> > I tried putting a rng('shuffle') command in jobStartup.m and in
>> >> > taskStartup.m, but I couldn't achieve a robust result fulfilling 1)
>> >> & 2)
>> >> > above. It is not clear to me how an rng(something) command in
>> >> > jobStartup.m affects the tasks
>> >> >
>> >> > I have also tried passing the seed as a parameter to each task, by
>> >> > creating the seed for each task on the basis of the progressive
>> task's
>> >> > number (say the task ID...). However, this is not very robust,
>> because
>> >> > if you start your code twice, the number of tasks, combined with the
>> >> > time difference of the two runs can lead to partially identically
>> >> > results (this is because in one case you use, e.g.,
>> >> > seed=time+task_number, in the second case you use
>> >> > seed=time+delta_time+task_number, and for a given delta_time and two
>> >> > different task_number you could get the same seed).
>> >> >
>> >> > So, this is the problem.
>> >> > I post below a code which reproduce the issue, at least on my
>> hardware.
>> >> > In my case the local profile run 4 workers (plus one client)
>> because I
>> >> > have a quad core. Note that the issue does not happen always, so it
>> >> > might be necessary to run the code a few times to see a
>> repetition in
>> >> > the generation.
>> >> > As you will see, in the task creation there are 4 options. Note
>> that:
>> >> > - option 4 does not lead to repetition in my case, but results
>> are the
>> >> > same at each run (looks like the starting seed for the tasks is
>> always
>> >> > the same...i.e. 0). So this option is not usable.
>> >> > - option 3: in my case leads to some repetitions in the generated
>> >> > numbers. So this is not working.
>> >> > - option 2: can potentially lead to repetitions if the operations
>> >> within
>> >> > the for-loop are faster than the "shuffle time accuracy". In my
>> case I
>> >> > have not noticed any repetition, so this looks like the preferable
>> >> > option...but I am not 100% sure...a possibility would be to add a
>> >> > pause(0.01) command in the loop (just to be sure), but this is not
>> >> > fantastic...
>> >> > - option 1: can potentially lead to repetitions between different
>> runs
>> >> > of the code
>> >> >
>> >> > a global alternative would be to create seed beforehand for each
>> >> task...
>> >> >
>> >> > ok, the code is below...
>> >> >
>> >> > %-------------
>> >> > %Main script
>> >> > %
>> >> >
>> >> > %% Identify a cluster:
>> >> > parallel.defaultClusterProfile('local');
>> >> > c = parcluster();
>> >> >
>> >> > %% Create a job
>> >> > j = createJob(c);
>> >> >
>> >> > %% Create tasks within a job
>> >> > %test random number generation
>> >> > Ntests=6*5;
>> >> > for jtest=1:Ntests, %create Ntests tasks
>> >> >
>> >> > %t(jtest)=createTask(j, @f_myrand_with_seed, 1,
>> >> > {[3,1],getfield(rng,'Seed')+jtest}); %option 1: fix the seed from
>> the
>> >> > main script on the basis of the seed of the client
>> >> > %t(jtest)=createTask(j, @f_myrand_with_seed, 1,
>> >> > {[3,1],getfield(rng(rng('shuffle')),'Seed')}); %option 2:
>> generate the
>> >> > seed using "shuffle" at this moment
>> >> > %t(jtest)=createTask(j, @f_myrand_with_seed, 1, {[3,1],[]});
>> >> %option
>> >> > 3: let the function generating the seed internally, using shuffle
>> >> > t(jtest)=createTask(j, @f_myrand_with_seed, 1, {[3,1],-1});
>> %option
>> >> > 4: let the task using the seed it is supposed to use
>> >> >
>> >> > end;
>> >> >
>> >> > %% Submit the job to the queue
>> >> > submit(j);
>> >> >
>> >> > %% Wait for the job to complete:
>> >> > wait(j)
>> >> >
>> >> > %% Get results
>> >> > results = fetchOutputs(j);
>> >> >
>> >> > %% Delete the job and permanently remove the job from the
>> scheduler's
>> >> > storage location
>> >> > delete(j)
>> >> >
>> >> > %% Check the output
>> >> > %if two columns are equal, it means the corresponding tasks started
>> >> from
>> >> > the same %random seed...which is something not wanted!
>> >> > fprintf('\nIf two columns are equal, this is bad...')
>> >> > final_data=[results{:}]
>> >> > if any(diff(sort(final_data(1,:)))==0), %checking the first line is
>> >> > sufficient in this case
>> >> > fprintf('\n...there is a generation problem!\n');
>> >> > else
>> >> > fprintf('\n...this generation seems to be ok!\n');
>> >> > end;
>> >> >
>> >> > %----------------------------
>> >> >
>> >> > %---------------------------
>> >> > %Additional function
>> >> >
>> >> > function out=f_myrand_with_seed(dim,sd)
>> >> >
>> >> > if nargin>1 && ~isempty(sd),
>> >> > if sd>0, %change the seed to the required value
>> >> > rng(sd);
>> >> > end; %note that, when sd<0 we do NOTHING
>> >> > else
>> >> > rng('shuffle'); %use the clock-based seed
>> >> > end;
>> >> > out=rand(dim);
>> >> > %out=rng;out=out.Seed; %use this line to have the seed from the
>> present
>> >> > task
>> >> > %-----------------------------
>> >> >
>> >> > thanks for your comments...
>> >> >
>> >> > bye,
>> >> > gabriele

Subject: Parallel Computing Toolbox - Random numbers generation within

From: Gabriele

Date: 16 Apr, 2013 15:05:07

Message: 8 of 8

Hello Peter,
thanks for your reply, but, reading at it, I guess I did not explain my point sufficiently clear.

Whatever way the generator was designed, the point is that, in my application (but I guess in many other cases), it is simply ***not acceptable*** to have the same sequence of random numbers associated with a couple (jobID,taskID).

This means that it is necessary to find a way out.

One option proposed below is to set the stream index on the basis of the jobID, and the substream index on the basis of the taskID, and then using also "shuffle" for the seed.

If I got you correctly you say this is not working properly because shuffle does not guarantee the independence between different sequences associated with different stream/substream index.

The other option I was proposing, then, is to "shuffle the stream and substream indices", i.e. setting them on the basis of the jobID, taskID and the time the job was created. I was proposing to use also the shuffle in the seed on top of that. However, if I got you correctly, this would not be appropriate.

Maybe it could then be possible to remove the "shuffle" in the seed but still setting stream and substream indices on the basis of the jobID, taskID and the time the job was created, i.e.:

%get task ID, (parent) job ID and job creation time
taskID = task.ID;
job = task.Parent;
jobID = job.ID;
job_creation_time=job.CreateTime;
%convert the job creation time to seconds
job_creation_time=job_creation_time([1:20,26:29]); %remove the time zone
job_creation_time=round(datenum(job_creation_time,'ddd mmm dd HH:MM:SS yyyy')*86400); %convert (units: seconds)
%create shift indices using the job creation time:
shift_index_stream=job_creation_time; %shift index for stream
shift_index_substream=round(job_creation_time/1000); %shift index for substream
%Now:
%1) Create a large number of independent streams;
%2) Select the stream using shift_index_stream and jobID
%3) For this particular task use a substream identified by
% shift_index_substream and taskID
%
NS=2^63; %number of multiple independent streams
s = RandStream.create('mrg32k3a', 'NumStreams', NS, 'Stream', shift_index_stream+jobID); %note: there is no shuffle
s.Substream = shift_index_substream+taskID;
RandStream.setGlobalStream(s);

This should keep the independence between sequences in different tasks, and should guarantee that the generated numbers, unless the same jobID and taskID are created within a too short time difference (which is typically not the case...).

Gabriele
  
Peter Perkins <Peter.Remove.Perkins.This@mathworks.com> wrote in message <kki95b$mh0$1@newscl01ah.mathworks.com>...
> If you're doing that many parallel simulations, you should be using the
> generator in the way in which it was designed. It has parallelism
> designed into it in the form of 2^53*2^64 substreams to choose from.
> The mrg32k3a generator was not designed to be parallelized via seeds.
>
> Will it matter? Who can say. But the generator was tested to verify that
> it gives statistical independence between streams and substreams, not
> between different seeds.
>
>
> On 4/13/2013 10:05 AM, Gabriele wrote:
> > Dear Peter,
> > the reason why I need of shuffle the seed is very simple.
> >
> > Suppose that the code work this way:
> > 1) First it generates one job with two tasks
> > 2) The job is then submitted
> > 3) After completion, the job is deleted
> >
> > In such a case the job, for instance, has jobID=1, then the taskID are 1
> > and 2.
> > The stream index is then 1, with substream indices 1 and two for the two
> > tasks respectively.
> >
> > When the job is deleted, the jobID is removed. This means that, if I run
> > again the same code, the new jobID can be, again, 1, with taskID 1 and 2
> > (again).
> >
> > In such a situation (which is actually my real situation), if you don't
> > use a shuffle the random number generator will generate again exactly
> > the same random numbers for task 1 and task 2.
> >
> > As a result, without shuffle, whenever the couple (jobID,taskID) is the
> > same, the same random numbers are generated.
> > If you work with the local cluster profile, it is common you delete your
> > jobs after completion and after gathering results. This means that it is
> > not uncommon you start your new set of jobs from jobID=1. As a result,
> > without using shuffle, it is very common you always generate the same
> > random number series.
> >
> > In addition, for the same reason, it is very common (I would say almost
> > sure) that the stream indices and substream indices you are going to use
> > will be the low ones (I cannot imagine a relatively standard system
> > where the jobID/taskID have reached order of magnitudes of, e.g.,
> > 2^30...). For this reason I was proposing the option B which tends to
> > sift the stream and substream indices towards higher values.
> >
> > What do you think?
> >
> > Moreover, any comment on the 2^31-1 matter in the shuffle algorithm in
> > the RandStream class?
> >
> > Thanks,
> > Gabriele
> >
> > Peter Perkins <Peter.Remove.Perkins.This@mathworks.com> wrote in message
> > <kk95qg$e76$1@newscl01ah.mathworks.com>...
> >> >> However, in general this approach tends to use always the "low"
> >> >> stream/substream (because usually the jobID and taskID are relatively
> >> >> small numbers comapred to the max number of streams / substreams).
> >>
> >> Why do you care? The proper statistical properties are already built
> >> into the algorithms. Just let that happen.
> >>
> >> >> s = RandStream.create('mrg32k3a', 'NumStreams', 2^63, 'Stream',
> >> jobID, 'seed', 'shuffle');
> >>
> >> It's hard to imagine that with 2^63 streams and 2^51 substreams, you
> >> really need to shuffle the seed. In any case:
> >>
> >> >> help randstream.create
> >> RandStream.create Create multiple independent random number streams.
> >> [snip]
> >> 'NumStreams', 'StreamIndices', and 'Seed' can be used to ensure
> >> that multiple streams created at different times are independent.
> >> Streams of the same type and created using the same value for
> >> 'NumStreams' and 'Seed', but with different values of
> >> 'StreamIndices', are independent even if they were created in
> >> separate calls to RandStream.create.
> >>
> >> The converse of that, probably not stated clearly enough (I will make
> >> a note to have that improved), is that if you use different seeds for
> >> the parallel generators, then all bets are off as far as independence
> >> goes. You are perhaps OK, but mrg32k3a was designed to use achieve
> >> independence using streams/substreams with the the same seed.
> >>
> >>
> >> On 4/11/2013 10:06 AM, Gabriele wrote:
> >> > Hi Peter,
> >> > sorry for the late reply, but I was doing some testing.
> >> > Thanks for the suggestions.
> >> >
> >> > I had some exchange of e-mails with the matlab support, and I received
> >> > some good suggestions on this point.
> >> >
> >> > Such suggestions goes along your line, i.e. using stream and substream.
> >> >
> >> > Now I have mainly two possibilities to select from.
> >> >
> >> > Option A:
> >> > Prepare a file taskStartup.m embedding the following code
> >> >
> >> > %------------------
> >> > taskID = task.ID;
> >> > job = task.Parent;
> >> > jobID = job.ID;
> >> >
> >> > s = RandStream.create('mrg32k3a', 'NumStreams', 2^63, 'Stream',
> >> > jobID, 'seed', 'shuffle');
> >> > s.Substream = taskID;
> >> > RandStream.setGlobalStream(s); %------------------
> >> >
> >> > Basically, the stream is selected on the basis of the jobID and the
> >> > substream is selected on the basis of the taskID. In addition to this,
> >> > the seed of the stream s is based on the clock time.
> >> > My feeling is that should work, because even if the jobId and the
> >> taskID
> >> > are the same for two different calculations (suppose a previous
> >> > (job,task) was properly deleted), the "shuffle" command should modified
> >> > the seed, this leading to different generations.
> >> >
> >> > However, in general this approach tends to use always the "low"
> >> > stream/substream (because usually the jobID and taskID are relatively
> >> > small numbers comapred to the max number of streams / substreams).
> >> >
> >> > An alternative I was thinking of would be as follows:
> >> >
> >> > Option B: %-----------------------------------
> >> >
> >> > %get task ID, (parent) job ID and job creation time
> >> > taskID = task.ID;
> >> > job = task.Parent;
> >> > jobID = job.ID;
> >> > job_creation_time=job.CreateTime;
> >> >
> >> > %convert the job creation time to seconds
> >> > job_creation_time=job_creation_time([1:20,26:29]); %remove the time
> >> zone
> >> > job_creation_time=round(datenum(job_creation_time,'ddd mmm dd HH:MM:SS
> >> > yyyy')*86400); %convert (units: seconds)
> >> > %create shift indices using the job creation time:
> >> > shift_index_stream=job_creation_time; %shift index for stream
> >> > shift_index_substream=round(job_creation_time/1000); %shift index for
> >> > substream
> >> > %Now:
> >> > %1) Create a large number of independent streams;
> >> > %2) Select the stream using shift_index_stream and jobID
> >> > %3) Generate also a random seed using "shuffle"
> >> > %4) For this particular task use a substream identified by
> >> > % shift_index_substream and taskID
> >> > %
> >> > NS=2^63; %number of multiple independent streams
> >> > s = RandStream.create('mrg32k3a', 'NumStreams', NS, 'Stream',
> >> > shift_index_stream+jobID,'seed','shuffle');
> >> > s.Substream = shift_index_substream+taskID;
> >> > RandStream.setGlobalStream(s);
> >> > %-----------------------------------
> >> >
> >> > The idea is to create a shifting of indices for the stream and
> >> > substreams. Such shifting is is based on the jobID and the job creation
> >> > time for the stream, and on the taskID and the job creation time for
> >> the
> >> > substream. On top of this, "shuffle" is used to create a seed which is
> >> > based on the task startup time (since taskStartup.m is called at the
> >> > starting of the task).
> >> > This looks like "mixing" things a little bit more (since the index of
> >> > streams/substreams which is used more diverse). However, I'm not sure
> >> > this is giving correct statistical properties.
> >> >
> >> > So, the question is: considering that both seems to do the job, is it
> >> > better using option A or option B?
> >> >
> >> > Thanks,
> >> > Gabriele
> >> >
> >> > PS: I have noticed something looking strange to me in the definition of
> >> > the class RandStream, where the shuffle algorithm is implemented
> >> > (function seed = shuffleSeed).
> >> > I see the following :
> >> > line #733: seed0 = mod(floor(now*8640000),2^31-1);
> >> > line #735: seed = mod(floor(now*8640000),2^31-1);
> >> >
> >> > However, considering the seed can be any number smaller than 2^32, I
> >> > would have expected:
> >> > line #733: seed0 = mod(floor(now*8640000),2^32-1);
> >> > line #735: seed = mod(floor(now*8640000),2^32-1);
> >> >
> >> > why is the shuffle seed limited to 2^31-1?
> >> >
> >> > Peter Perkins <Peter.Remove.Perkins.This@mathworks.com> wrote in
> >> message
> >> > <kjem6b$e28$1@newscl01ah.mathworks.com>...
> >> >> Gabriele, you're doing large-scale parallel simulations. You should be
> >> >> using the right tools for that. Setting seeds based on current time or
> >> >> whatever is like throwing darts at a dartboard. You need something
> >> >> more controlled.
> >> >>
> >> >> MATLAB includes two random number generators, mrg32k3a and mldfg6331,
> >> >> that are specifically designed for the kind of thing you're doing.
> >> >> They both support multiple independent streams and substreams. (the
> >> >> latter is more or less a lighterweight version of the former). I can't
> >> >> really follow all of the "topology" that you describe, but by basing
> >> >> the stream (or substream) index on the tasks, or workers, or runs, you
> >> >> can ensure that you don't reuse the same random numbers.
> >> >>
> >> >> This is described at length in a couple of blog posts:
> >> >>
> >> >>
> >> http://blogs.mathworks.com/loren/2008/11/05/new-ways-with-random-numbers-part-i
> >>
> >> >>
> >> >>
> >> http://blogs.mathworks.com/loren/2008/11/13/new-ways-with-random-numbers-part-ii
> >>
> >> >>
> >> >>
> >> >> I hope this is helpful.
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On 3/25/2013 6:34 AM, Gabriele wrote:
> >> >> > Hi All,
> >> >> > I am having some problems in consistently generating random numbers
> >> >> > within tasks.
> >> >> > I suppose my problems come from the fact that it is not clear to
> >> me how
> >> >> > the seed for the stream is handled by the tasks belonging to a job.
> >> >> >
> >> >> > So, to make a long story short, and to simplify the problem, I
> >> have a
> >> >> > job, which comprises some tasks. Each task is generating random
> >> >> numbers.
> >> >> > I would like, of course, that:
> >> >> > 1) Generated (pseudo-)random numbers are different from task to task
> >> >> > (actually also between tasks belonging to different jobs);
> >> >> > 2) Generated (pseudo-)random numbers are different if I run the code
> >> >> twice.
> >> >> >
> >> >> > Unfortunately, I cannot manage to get both.
> >> >> > If I just make a "plain code" (simply calling, e.g., "rand" from
> >> each
> >> >> > task) I achieve 1), but I do not achieve 2), i.e. outcomes from
> >> tasks
> >> >> > are different, but If I run the code twice, I get exactly the same
> >> >> > outcomes.
> >> >> > If I try to force the seed (using, e.g., rng('shuffle')) I have the
> >> >> > problem that, in some cases, different tasks (typically 2 tasks)
> >> seem
> >> >> > like starting at the same time (within the accuracy of the "shuffle"
> >> >> > algorithm, which seems to be 1/100s looking at randstream.m). As a
> >> >> > result, some outcomes are different, while other are the same.
> >> >> >
> >> >> > I tried putting a rng('shuffle') command in jobStartup.m and in
> >> >> > taskStartup.m, but I couldn't achieve a robust result fulfilling 1)
> >> >> & 2)
> >> >> > above. It is not clear to me how an rng(something) command in
> >> >> > jobStartup.m affects the tasks
> >> >> >
> >> >> > I have also tried passing the seed as a parameter to each task, by
> >> >> > creating the seed for each task on the basis of the progressive
> >> task's
> >> >> > number (say the task ID...). However, this is not very robust,
> >> because
> >> >> > if you start your code twice, the number of tasks, combined with the
> >> >> > time difference of the two runs can lead to partially identically
> >> >> > results (this is because in one case you use, e.g.,
> >> >> > seed=time+task_number, in the second case you use
> >> >> > seed=time+delta_time+task_number, and for a given delta_time and two
> >> >> > different task_number you could get the same seed).
> >> >> >
> >> >> > So, this is the problem.
> >> >> > I post below a code which reproduce the issue, at least on my
> >> hardware.
> >> >> > In my case the local profile run 4 workers (plus one client)
> >> because I
> >> >> > have a quad core. Note that the issue does not happen always, so it
> >> >> > might be necessary to run the code a few times to see a
> >> repetition in
> >> >> > the generation.
> >> >> > As you will see, in the task creation there are 4 options. Note
> >> that:
> >> >> > - option 4 does not lead to repetition in my case, but results
> >> are the
> >> >> > same at each run (looks like the starting seed for the tasks is
> >> always
> >> >> > the same...i.e. 0). So this option is not usable.
> >> >> > - option 3: in my case leads to some repetitions in the generated
> >> >> > numbers. So this is not working.
> >> >> > - option 2: can potentially lead to repetitions if the operations
> >> >> within
> >> >> > the for-loop are faster than the "shuffle time accuracy". In my
> >> case I
> >> >> > have not noticed any repetition, so this looks like the preferable
> >> >> > option...but I am not 100% sure...a possibility would be to add a
> >> >> > pause(0.01) command in the loop (just to be sure), but this is not
> >> >> > fantastic...
> >> >> > - option 1: can potentially lead to repetitions between different
> >> runs
> >> >> > of the code
> >> >> >
> >> >> > a global alternative would be to create seed beforehand for each
> >> >> task...
> >> >> >
> >> >> > ok, the code is below...
> >> >> >
> >> >> > %-------------
> >> >> > %Main script
> >> >> > %
> >> >> >
> >> >> > %% Identify a cluster:
> >> >> > parallel.defaultClusterProfile('local');
> >> >> > c = parcluster();
> >> >> >
> >> >> > %% Create a job
> >> >> > j = createJob(c);
> >> >> >
> >> >> > %% Create tasks within a job
> >> >> > %test random number generation
> >> >> > Ntests=6*5;
> >> >> > for jtest=1:Ntests, %create Ntests tasks
> >> >> >
> >> >> > %t(jtest)=createTask(j, @f_myrand_with_seed, 1,
> >> >> > {[3,1],getfield(rng,'Seed')+jtest}); %option 1: fix the seed from
> >> the
> >> >> > main script on the basis of the seed of the client
> >> >> > %t(jtest)=createTask(j, @f_myrand_with_seed, 1,
> >> >> > {[3,1],getfield(rng(rng('shuffle')),'Seed')}); %option 2:
> >> generate the
> >> >> > seed using "shuffle" at this moment
> >> >> > %t(jtest)=createTask(j, @f_myrand_with_seed, 1, {[3,1],[]});
> >> >> %option
> >> >> > 3: let the function generating the seed internally, using shuffle
> >> >> > t(jtest)=createTask(j, @f_myrand_with_seed, 1, {[3,1],-1});
> >> %option
> >> >> > 4: let the task using the seed it is supposed to use
> >> >> >
> >> >> > end;
> >> >> >
> >> >> > %% Submit the job to the queue
> >> >> > submit(j);
> >> >> >
> >> >> > %% Wait for the job to complete:
> >> >> > wait(j)
> >> >> >
> >> >> > %% Get results
> >> >> > results = fetchOutputs(j);
> >> >> >
> >> >> > %% Delete the job and permanently remove the job from the
> >> scheduler's
> >> >> > storage location
> >> >> > delete(j)
> >> >> >
> >> >> > %% Check the output
> >> >> > %if two columns are equal, it means the corresponding tasks started
> >> >> from
> >> >> > the same %random seed...which is something not wanted!
> >> >> > fprintf('\nIf two columns are equal, this is bad...')
> >> >> > final_data=[results{:}]
> >> >> > if any(diff(sort(final_data(1,:)))==0), %checking the first line is
> >> >> > sufficient in this case
> >> >> > fprintf('\n...there is a generation problem!\n');
> >> >> > else
> >> >> > fprintf('\n...this generation seems to be ok!\n');
> >> >> > end;
> >> >> >
> >> >> > %----------------------------
> >> >> >
> >> >> > %---------------------------
> >> >> > %Additional function
> >> >> >
> >> >> > function out=f_myrand_with_seed(dim,sd)
> >> >> >
> >> >> > if nargin>1 && ~isempty(sd),
> >> >> > if sd>0, %change the seed to the required value
> >> >> > rng(sd);
> >> >> > end; %note that, when sd<0 we do NOTHING
> >> >> > else
> >> >> > rng('shuffle'); %use the clock-based seed
> >> >> > end;
> >> >> > out=rand(dim);
> >> >> > %out=rng;out=out.Seed; %use this line to have the seed from the
> >> present
> >> >> > task
> >> >> > %-----------------------------
> >> >> >
> >> >> > thanks for your comments...
> >> >> >
> >> >> > bye,
> >> >> > gabriele

Tags for this Thread

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Contact us