Thread Subject: Distributed computing - lost connection to worker

Subject: Distributed computing - lost connection to worker

From: Steffen

Date: 1 Jul, 2009 07:48:01

Message: 1 of 7

Hi,

my cluster runs so far pretty well but since I?m using ordinary desktop pc's from colleagues it happens from time to time that parfor shuts down the session due to a missing of one or more labs (restart of one pc etc.).
Is there any way to circumvent this complete breakdown, e.g. the other remaining workers finish the job of the missing worker? At the moment everything is lost if one worker gets off.

Currently, I'm starting everything like:
matlabpool close force %stop all processes
matlabpool open Parallel_Config %restart pool
parfor ii=1:500
for jj=1:400
[dummy]=func_cluster(handles);
end
end

Many thanks in advance and best regards,

Steffen

Subject: Distributed computing - lost connection to worker

From: Raymond Norris

Date: 1 Jul, 2009 12:44:02

Message: 2 of 7

Steffen,

When you open a MATLAB Pool, you're setting up an MPI ring, regardless of whether you execute a parfor or spmd. This requires that all labs (Workers) are up and running. If lab #1 was sending a message to lab #2, but lab #2 was down, your code would hang. Therefore, as a precaution, anytime one lab goes down, we shutdown the ring (and hence your session ending prematurely).

For a task-parallel application, there are two design patterns to use. One is the parfor, which is the most seamless. The second is creating jobs and tasks, which does not convert a for to a parfor or use a MATLAB Pool. It's behavior is how you describe it. There is no MPI ring and therefore, if a Worker goes down the other Workers can continue the remaining tasks.

An example might be:

j = createJob('FileDependencies','func_cluster.m');
for tidx = 1:500
   j.createTask(@func_cluster,1,{handles(tidx)})
end
j.submit
j.wait
dummy = j.getAllOutputArguments();

Raymond

"Steffen" <rileksn@gmail.com> wrote in message <h2f4bh$32s$1@fred.mathworks.com>...
> Hi,
>
> my cluster runs so far pretty well but since I?m using ordinary desktop pc's from colleagues it happens from time to time that parfor shuts down the session due to a missing of one or more labs (restart of one pc etc.).
> Is there any way to circumvent this complete breakdown, e.g. the other remaining workers finish the job of the missing worker? At the moment everything is lost if one worker gets off.
>
> Currently, I'm starting everything like:
> matlabpool close force %stop all processes
> matlabpool open Parallel_Config %restart pool
> parfor ii=1:500
> for jj=1:400
> [dummy]=func_cluster(handles);
> end
> end
>
> Many thanks in advance and best regards,
>
> Steffen

Subject: Distributed computing - lost connection to worker

From: Steffen

Date: 2 Jul, 2009 08:24:02

Message: 3 of 7

Dear Raymond,

thanks for the hint. For simple stuff I managed it to work with your method. However, I don't get it to work with my code which is a bit more complex. ;)

At present (with parfor) each worker calculates an entire row of a data set and returns both cell arrays as well as double arrays with the results for each column of this particular row. How do i pass double arrays into the createTask fcn and most importantly how to get cell&double arrays out? Is it possible to calculate it row-wise and then stick the entire results of the (i,j)-matrix back together with the example code shown below?
Just to make sure I make my point, simplified my current code with parfor:

parfor i=1:size(series,2)
   B_rows=zeros(1,size(series,1));
   I_rows=cell(1,size(series,1));
   for j=1:size(series,1)
      I=squeeze(series(j,i,:));
      [B_rows(j), I_rows{j}] = func_cluster(handles, I, i, j);
   end
B(:,i)=B_rows;
I(:,i)=I_rows;
end

Many thanks for the help. Much appreciated!!!

Subject: Distributed computing - lost connection to worker

From: Raymond Norris

Date: 2 Jul, 2009 13:10:02

Message: 4 of 7

Steffen,

Putting in dummy data, as a for loop, I couldn't get your code to work. I can prototype a job/task implementation, but perhaps you could flesh this out a bit more for me.

1. series is perhaps some 3-D matrix?
2. I don't think I care about handles, as it's only used in func_cluster()
3. Can you write a simple function for func_cluster that returns dummy data? I just need to see what it's assigning to B_rows and I_rows.

Thanks,
Raymond

"Steffen" <rileksn@gmail.com> wrote in message <h2hqr1$4k$1@fred.mathworks.com>...
> Dear Raymond,
>
> thanks for the hint. For simple stuff I managed it to work with your method. However, I don't get it to work with my code which is a bit more complex. ;)
>
> At present (with parfor) each worker calculates an entire row of a data set and returns both cell arrays as well as double arrays with the results for each column of this particular row. How do i pass double arrays into the createTask fcn and most importantly how to get cell&double arrays out? Is it possible to calculate it row-wise and then stick the entire results of the (i,j)-matrix back together with the example code shown below?
> Just to make sure I make my point, simplified my current code with parfor:
>
> parfor i=1:size(series,2)
> B_rows=zeros(1,size(series,1));
> I_rows=cell(1,size(series,1));
> for j=1:size(series,1)
> I=squeeze(series(j,i,:));
> [B_rows(j), I_rows{j}] = func_cluster(handles, I, i, j);
> end
> B(:,i)=B_rows;
> I(:,i)=I_rows;
> end
>
> Many thanks for the help. Much appreciated!!!

Subject: Distributed computing - lost connection to worker

From: Steffen

Date: 2 Jul, 2009 16:06:01

Message: 5 of 7

Hi Raymond,

thanks for taking care of this. Very much appreciated!!
The script below should work and is very simplified what I need to do, but somehow convert it so that it works with createTasks. Ideally, I?d like to assign one row of series with 10 columns to one worker. Afterwards it is required to know which B and A belong to which pixel (i,j). Normally the series is quite huge (512x512x500)...

series=rand(1)*100*ones(10,10,10);
handles.name='test';
A=cell(10,10);
parfor i=1:size(series,2)
   B_rows=zeros(1,size(series,1));
   I_rows=cell(1,size(series,1));
   for j=1:size(series,1)
      I=double(squeeze(series(j,i,:)));
      [B_rows(j), I_rows{j}] = func_cluster(handles, I, i, j);
   end
B(:,i)=B_rows;
A(:,i)=I_rows;
end

function [B,I2]=func_cluster(handles, I, i, j)
B=mean(mean(I))*i*j;
I2=I*i;

Thanks...

Subject: Distributed computing - lost connection to worker

From: Raymond Norris

Date: 2 Jul, 2009 20:05:17

Message: 6 of 7

Hi Steffen,

As I first stated, using parfor is much more seamless, but here's one approach. Pull out the majority of the work in the outer for loop and save it as a function, unit_of_work.m. unit_of_work() will return 2 output arguments and take three input arguments. If 'series' gets to be quite large, you could assign is to the job data (instead of pass it to each task) and then query for it in the unit_of_work (I'll explain below in my comments).

%%%%%%%%%% Top level function %%%%%%%%%%
series = rand*100*ones(10,10,10);
handles.name = 'test';

j = createJob('FileDependencies',{'unit_of_work.m','func_cluster.m'});

for i=1:size(series,2)
   j.createTask(@unit_of_work,2,{series,handles,i});
end
j.submit, j.wait
out = j.getAllOutputArguments();
B = [out{:,1}];
A = [out{:,2}];

To pass series around as job data, add
   j.JobData = series;
after you create the job and change creating the task
   j.createTask(@unit_of_work,2,{handles,i});

Here's what unit_of_work.m looks like

%%%%%%%%%% unit_of_work.m %%%%%%%%%%
function [B_rows, I_rows] = unit_of_work(series, handles, i)

sz = size(series,1);
B_rows = zeros(sz,1);
I_rows = cell(sz,1);
for j = 1:sz
   I = double(squeeze(series(j,i,:)));
   [B_rows(j), I_rows{j}] = func_cluster(handles, I, i, j);
end

end

If you pass series around as job data, then series would not be passed into unit_of_work. So add to the top of the function

function ...

 j = getCurrentJob();
 series = j.JobData;

 sz = ...

Hope this helps.
Raymond


"Steffen" <rileksn@gmail.com> wrote in message <h2ilt9$7e5$1@fred.mathworks.com>...
> Hi Raymond,
>
> thanks for taking care of this. Very much appreciated!!
> The script below should work and is very simplified what I need to do, but somehow convert it so that it works with createTasks. Ideally, I?d like to assign one row of series with 10 columns to one worker. Afterwards it is required to know which B and A belong to which pixel (i,j). Normally the series is quite huge (512x512x500)...
>
> series=rand(1)*100*ones(10,10,10);
> handles.name='test';
> A=cell(10,10);
> parfor i=1:size(series,2)
> B_rows=zeros(1,size(series,1));
> I_rows=cell(1,size(series,1));
> for j=1:size(series,1)
> I=double(squeeze(series(j,i,:)));
> [B_rows(j), I_rows{j}] = func_cluster(handles, I, i, j);
> end
> B(:,i)=B_rows;
> A(:,i)=I_rows;
> end
>
> function [B,I2]=func_cluster(handles, I, i, j)
> B=mean(mean(I))*i*j;
> I2=I*i;
>
> Thanks...

Subject: Distributed computing - lost connection to worker

From: Steffen

Date: 9 Jul, 2009 12:02:02

Message: 7 of 7

Hi Raymond,

sorry for taking so long replying. It took me quite a while to re-write all my code, but it finally works just fine!!

Thanks a lot for the help!

Tags for this Thread

Everyone's Tags:

Add a New Tag:

Separated by commas
Ex.: root locus, bode

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Tag Activity for This Thread
Tag Applied By Date/Time
labs jobs tasks... Raymond Norris 1 Jul, 2009 08:49:04
connection Steffen 1 Jul, 2009 03:49:02
worker Steffen 1 Jul, 2009 03:49:02
distributed com... Steffen 1 Jul, 2009 03:49:02
rssFeed for this Thread

Contact us at files@mathworks.com