Thread Subject: parfor error message

Subject: parfor error message

From: Mr. CFD

Date: 6 Nov, 2009 09:31:02

Message: 1 of 7

Hi,
I have a simulation running on a cluster using the parfor command. The simulation has previously run successfully with no problems, but today it was terminated mid-way with the following error:

"The session that parfor is using has shut down"

Upon further inspection, I have traced the parfor statement within the code which the error is referring to. I'm at a complete loss to explain the cause of this error. Especially, since the code has been used successfully in previous occasions, I have no idea how to deal with this problem. What can cause the parfor command to ‘shut-down’? Any advice please.

Thanks

Subject: parfor error message

From: Edric M Ellis

Date: 6 Nov, 2009 13:28:54

Message: 2 of 7

"Mr. CFD" <s2108860@student.rmit.edu.au> writes:

> I have a simulation running on a cluster using the parfor command. The
> simulation has previously run successfully with no problems, but today it was
> terminated mid-way with the following error:
>
> "The session that parfor is using has shut down"
>
> Upon further inspection, I have traced the parfor statement within the code
> which the error is referring to. I'm at a complete loss to explain the cause
> of this error. Especially, since the code has been used successfully in
> previous occasions, I have no idea how to deal with this problem. What can
> cause the parfor command to ‘shut-down’? Any advice please.

Are you using an interactive MATLABPOOL (i.e. calling "matlabpool open ..." in
your desktop MATLAB session)?

That error message literally means that the pool has been closed unexpectedly -
the connection to the workers simply disappeared. This could happen if a worker
crashed for example.

What sort of cluster are you running on? (Many clusters have various resource
usage limits after which they terminate jobs - maybe you're hitting one of
those?)

Cheers,

Edric.

Subject: parfor error message

From: Mr. CFD

Date: 9 Nov, 2009 13:33:01

Message: 3 of 7

Edric M Ellis <eellis@mathworks.com> wrote in message <ytwaayzg4tl.fsf@uk-eellis-deb5-64.mathworks.co.uk>...
> "Mr. CFD" <s2108860@student.rmit.edu.au> writes:
>
> > I have a simulation running on a cluster using the parfor command. The
> > simulation has previously run successfully with no problems, but today it was
> > terminated mid-way with the following error:
> >
> > "The session that parfor is using has shut down"
> >
> > Upon further inspection, I have traced the parfor statement within the code
> > which the error is referring to. I'm at a complete loss to explain the cause
> > of this error. Especially, since the code has been used successfully in
> > previous occasions, I have no idea how to deal with this problem. What can
> > cause the parfor command to ‘shut-down’? Any advice please.
>
> Are you using an interactive MATLABPOOL (i.e. calling "matlabpool open ..." in
> your desktop MATLAB session)?
>
> That error message literally means that the pool has been closed unexpectedly -
> the connection to the workers simply disappeared. This could happen if a worker
> crashed for example.
>
> What sort of cluster are you running on? (Many clusters have various resource
> usage limits after which they terminate jobs - maybe you're hitting one of
> those?)
>
> Cheers,
>
> Edric.

Hi Edric,
The job is run on an external supercomputing cluster. Parfor commands are used and the 'createMatlabPoolJob' scheduler is applied. So each index of the parfor command is simulated by 'n' CPUs to accelerate computation.
I have run this same simulation on many occasions with no problems thus, for this error to shown an appearance now is strange and difficult to reason. Do you have any suggestions of what could be happening and how it can be avoided please?
Thanks

Subject: parfor error message

From: Edric M Ellis

Date: 9 Nov, 2009 14:25:04

Message: 4 of 7

"Mr. CFD" <s2108860@student.rmit.edu.au> writes:

> Edric M Ellis <eellis@mathworks.com> wrote in message <ytwaayzg4tl.fsf@uk-eellis-deb5-64.mathworks.co.uk>...
>> "Mr. CFD" <s2108860@student.rmit.edu.au> writes:
>>
>> > I have a simulation running on a cluster using the parfor command. The
>> > simulation has previously run successfully with no problems, but today it was
>> > terminated mid-way with the following error:
>> >
>> > "The session that parfor is using has shut down"
>> >
>> > Upon further inspection, I have traced the parfor statement within the code
>> > which the error is referring to. I'm at a complete loss to explain the cause
>> > of this error. Especially, since the code has been used successfully in
>> > previous occasions, I have no idea how to deal with this problem. What can
>> > cause the parfor command to ‘shut-down’? Any advice please.
>>
>> Are you using an interactive MATLABPOOL (i.e. calling "matlabpool open ..." in
>> your desktop MATLAB session)?
>>
>> That error message literally means that the pool has been closed unexpectedly -
>> the connection to the workers simply disappeared. This could happen if a worker
>> crashed for example.
>>
>> What sort of cluster are you running on? (Many clusters have various resource
>> usage limits after which they terminate jobs - maybe you're hitting one of
>> those?)
>>
>> Cheers,
>>
>> Edric.
>
> Hi Edric, The job is run on an external supercomputing cluster. Parfor
> commands are used and the 'createMatlabPoolJob' scheduler is applied. So each
> index of the parfor command is simulated by 'n' CPUs to accelerate
> computation. I have run this same simulation on many occasions with no
> problems thus, for this error to shown an appearance now is strange and
> difficult to reason. Do you have any suggestions of what could be happening
> and how it can be avoided please?

Ok, thanks for the info. If precisely the same matlabpool job previously used to
run correctly, and now reliably fails, that might indicate that something has
changed on your cluster. My first suspicion would be that limits on resource
usage may have changed - e.g. the walltime or memory limit for jobs.

Do you know if the job fails at exactly the same point each time? Do you know
how long (in terms of elapsed time) through your job things fall over? Can you
run a smaller problem on the cluster successfully? (E.g. can you reduce the data
size / number of iterations and have it work?)

One other thing to look out for - are there any matlab_crash_dump.* files
created in your home directory when the job runs? Is there anything in the
"debug log" of the job? (How you get to that depends on the scheduler).

Cheers,

Edric.

Subject: parfor error message

From: Mr. CFD

Date: 10 Nov, 2009 13:07:01

Message: 5 of 7

Edric M Ellis <eellis@mathworks.com> wrote in message <ytwbpjbepxb.fsf@uk-eellis-deb5-64.mathworks.co.uk>...
> "Mr. CFD" <s2108860@student.rmit.edu.au> writes:
>
> > Edric M Ellis <eellis@mathworks.com> wrote in message <ytwaayzg4tl.fsf@uk-eellis-deb5-64.mathworks.co.uk>...
> >> "Mr. CFD" <s2108860@student.rmit.edu.au> writes:
> >>
> >> > I have a simulation running on a cluster using the parfor command. The
> >> > simulation has previously run successfully with no problems, but today it was
> >> > terminated mid-way with the following error:
> >> >
> >> > "The session that parfor is using has shut down"
> >> >
> >> > Upon further inspection, I have traced the parfor statement within the code
> >> > which the error is referring to. I'm at a complete loss to explain the cause
> >> > of this error. Especially, since the code has been used successfully in
> >> > previous occasions, I have no idea how to deal with this problem. What can
> >> > cause the parfor command to ‘shut-down’? Any advice please.
> >>
> >> Are you using an interactive MATLABPOOL (i.e. calling "matlabpool open ..." in
> >> your desktop MATLAB session)?
> >>
> >> That error message literally means that the pool has been closed unexpectedly -
> >> the connection to the workers simply disappeared. This could happen if a worker
> >> crashed for example.
> >>
> >> What sort of cluster are you running on? (Many clusters have various resource
> >> usage limits after which they terminate jobs - maybe you're hitting one of
> >> those?)
> >>
> >> Cheers,
> >>
> >> Edric.
> >
> > Hi Edric, The job is run on an external supercomputing cluster. Parfor
> > commands are used and the 'createMatlabPoolJob' scheduler is applied. So each
> > index of the parfor command is simulated by 'n' CPUs to accelerate
> > computation. I have run this same simulation on many occasions with no
> > problems thus, for this error to shown an appearance now is strange and
> > difficult to reason. Do you have any suggestions of what could be happening
> > and how it can be avoided please?
>
> Ok, thanks for the info. If precisely the same matlabpool job previously used to
> run correctly, and now reliably fails, that might indicate that something has
> changed on your cluster. My first suspicion would be that limits on resource
> usage may have changed - e.g. the walltime or memory limit for jobs.
>
> Do you know if the job fails at exactly the same point each time? Do you know
> how long (in terms of elapsed time) through your job things fall over? Can you
> run a smaller problem on the cluster successfully? (E.g. can you reduce the data
> size / number of iterations and have it work?)
>
> One other thing to look out for - are there any matlab_crash_dump.* files
> created in your home directory when the job runs? Is there anything in the
> "debug log" of the job? (How you get to that depends on the scheduler).
>
> Cheers,
>
> Edric.

Hi Edric,
Thanks for the feedback. You did address some important issues.

I couldn't tell you the elapsed time when the job fails, but can say that its during the later stages of the design iteration process (around 500/600th iterate mark out of 800). This point is usually reached within the 6-7 hour mark given that the entire process should take around 11 hours. Thus, exceeding the walltime resource limit won’t be an issue, since the cluster allows for a maximum job time limit of 168 hours! I have put in the 'tic toc' tool so that I can record the time and hopefully get a better indication of when the error occurs and whether its within a certain threshold at each time. As I mentioned, I have run this simulation successfully on previous occasions, so reproducing the same error to note for any consistencies is a tedious process. The cluster administrator informs me that there have been no obvious changes to the resource limits for MATLAB jobs.

Since I first posted this query, I have run the simulation twice with no problems therefore diagnosing the source-of-error is very difficult. I have a series of try-catch statements within the code itself to record any errors within the syntax that may be causing this fault. For this particular error, the "The session that parfor is using has shut down" was reported and no crash dump files were generated.

From my experience with this code, I have noticed the following errors which appear on random.
a) Sudden termination of the simulation. For these errors a series of crash dump files are generated, which you mention. The error log reports from the try-catch statements show no such problems within the code itself, thus I'm assuming there could be an issue with the cluster itself. Now, I have submitted these files to the technical support unit and waiting for their feedback.

b) The most common error which I have noticed and this most definitely indicates an issue within the code itself is:

Error in ==> mysimulation>(parfor body factory) at 183
Undefined variable "out" or class "out.x".

The line 183, in simple terms is defined as follows:
parfor ii=1:n
[a,b,c]=dosomething(myoutputs.x)
....
....
end

This error is the most frustrating! Can you please advise how this is fixed? Also this error appears at random, therefore I’m finding it hard to get a fix on this.

c) The error with the "parfor shutting down" as initially reported in this post, was a complete surprise and off the many simulations that have been run, it has only surfaced once.

I'm still going through some of the points which you covered in your previous post in regards to the parfor issue. But in the meantime can you please provide feedback to point b)?

Thanks

Subject: parfor error message

From: Edric M Ellis

Date: 10 Nov, 2009 14:17:33

Message: 6 of 7

"Mr. CFD" <s2108860@student.rmit.edu.au> writes:

> b) The most common error which I have noticed and this most definitely indicates
> an issue within the code itself is:
>
> Error in ==> mysimulation>(parfor body factory) at 183 Undefined variable "out"
> or class "out.x".
>
> The line 183, in simple terms is defined as follows: parfor ii=1:n
> [a,b,c]=dosomething(myoutputs.x) .... .... end
>
> This error is the most frustrating! Can you please advise how this is fixed?
> Also this error appears at random, therefore I’m finding it hard to get a
> fix on this.

Is the "myoutputs" structure defined inside or outside the parfor loop?

I'm struggling to see how the error refers to "out" when the code refers to
"myoutputs". I'm also somewhat confused as to how that error message could show
up only sometimes. I'm even more confused why an "undefined variable" message
could appear sporadically. The only variability that one might expect when
running a PARFOR loop is in the way the loop iterations are divided among the
workers.

I wonder if perhaps the workers are running out of memory, and that's causing
weirdness. Is there any facility to track worker memory usage while you're
running this stuff? (Resource exhaustion just might also explain the hard
crashes as well as the strange errors that you're seeing).

Cheers,

Edric.

Subject: parfor error message

From: Mr. CFD

Date: 11 Nov, 2009 07:52:04

Message: 7 of 7

Edric M Ellis <eellis@mathworks.com> wrote in message <ytwy6mecvlu.fsf@uk-eellis-deb5-64.mathworks.co.uk>...
> "Mr. CFD" <s2108860@student.rmit.edu.au> writes:
>
> > b) The most common error which I have noticed and this most definitely indicates
> > an issue within the code itself is:
> >
> > Error in ==> mysimulation>(parfor body factory) at 183 Undefined variable "out"
> > or class "out.x".
> >
> > The line 183, in simple terms is defined as follows: parfor ii=1:n
> > [a,b,c]=dosomething(myoutputs.x) .... .... end
> >
> > This error is the most frustrating! Can you please advise how this is fixed?
> > Also this error appears at random, therefore I’m finding it hard to get a
> > fix on this.
>
> Is the "myoutputs" structure defined inside or outside the parfor loop?
>
> I'm struggling to see how the error refers to "out" when the code refers to
> "myoutputs". I'm also somewhat confused as to how that error message could show
> up only sometimes. I'm even more confused why an "undefined variable" message
> could appear sporadically. The only variability that one might expect when
> running a PARFOR loop is in the way the loop iterations are divided among the
> workers.
>
> I wonder if perhaps the workers are running out of memory, and that's causing
> weirdness. Is there any facility to track worker memory usage while you're
> running this stuff? (Resource exhaustion just might also explain the hard
> crashes as well as the strange errors that you're seeing).
>
> Cheers,
>
> Edric.


Hi Edric,
First of all many thanks for your feedback. I appreciate your thoughts on this rather annoying error. I have been digging around to get some more info which could explain why we have this issue:

Error in ==> mysimulation>(parfor body factory) at 183
Undefined variable "out" or class "out.x".

The line 183, in simple terms is defined as follows:
parfor ii=1:n
[a,b,c]=dosomething(myoutputs.x(ii,:)).
...
....
end

> Is the "myoutputs" structure defined inside or outside the parfor loop?
myoutputs.x is defined outside the parfor loop

The errstack (8 by 1 struct array) from the catch statement provides some vital information:
erroutputs.errstack(1,1):
file: '/usr/local/matlab/R2008b/toolbox/matlab/lang/parallel_function.m'
name: 'parallel_function'
line: 587

erroutputs.errstack(2,1):
Reports the actual error [Error in ==> mysimulation>(parfor body factory) at 183]

erroutputs.errstack(3,1):
file: /usr/local/matlab/R2008b/toolbox/distcomp/private/dctEvaluateFunction.m
name: 'iEvaluateWithNoErrors'
line: 21

erroutputs.errstack(4,1):
file: /usr/local/matlab/R2008b/toolbox/distcomp/private/dctEvaluateFunction.m
name: 'dctEvaluateFunction'
liine: 7

erroutputs.errstack(5,1):
file: '/usr/local/matlab/R2008b/toolbox/distcomp/private/dctEvaluateTask.m'
name: 'iEvaluateTask'
line: 95

erroutputs.errstack(6,1):
file: '/usr/local/matlab/R2008b/toolbox/distcomp/private/dctEvaluateTask.m'
 name: 'dctEvaluateTask'
line: 18

erroutputs.errstack(7,1):
file: /usr/local/matlab/R2008b/toolbox/distcomp/distcomp_evaluate_filetask.m
name: 'iDoTask'
line: 106

> Interrogating this error further we can see that line 106 is within the following commands:
=====================================================
try
    % If dctEvaluateTask throws an error then something went wrong in DCT
    % code not user code - and we need to exit the worker, not continue
    [output, errOutput, textOutput] = dctEvaluateTask(job, task, runprop);
    % Package up the output into a structure to pass around easily
    out = struct('output', {output}, 'errOutput', {errOutput}, 'textOutput', {textOutput});
catch e
    handlers.errorFcn(e, 'Unexpected error while evaluating task - MATLAB will now exit.');
end
=====================================================
Here we see the statement "If dctEvaluateTask throws an error then something went wrong in DCT code not user code - and we need to exit the worker, not continue"
This could further support your initial guess; maybe an issue in the cluster itself!

erroutputs.errstack(8,1):
file: /usr/local/matlab/R2008b/toolbox/distcomp/distcomp_evaluate_filetask.m
name: 'distcomp_evaluate_filetask'
line: 32

Alot of information here: I'm also confused as to why the error refers to "out" when the code refers to "myoutputs" and why the "undefined variable" message would appear on random:

I have been in touch with the administrators running the cluster. The memory used for this failed simulation was within the allocated resources, so they don’t feel memory resources could be a problem. In any case, I can track worker memory usage for future tasks, but haven’t re-started the simulation, since we don’t have a fix on this error as yet.

Hope this information will provide some answers.

Thanks

Tags for this Thread

Add a New Tag:

Separated by commas
Ex.: root locus, bode

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

rssFeed for this Thread

Contact us at files@mathworks.com