Path: news.mathworks.com!not-for-mail
From: Edric M Ellis <eellis@mathworks.com>
Newsgroups: comp.soft-sys.matlab
Subject: Re: parfor error message
Date: Mon, 09 Nov 2009 14:25:04 +0000
Organization: The Mathworks, Ltd.
Lines: 57
Message-ID: <ytwbpjbepxb.fsf@uk-eellis-deb5-64.mathworks.co.uk>
References: <hd0qcm$168$1@fred.mathworks.com> <ytwaayzg4tl.fsf@uk-eellis-deb5-64.mathworks.co.uk> <hd95md$3ta$1@fred.mathworks.com>
NNTP-Posting-Host: uk-eellis-deb5-64.mathworks.co.uk
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: fred.mathworks.com 1257776705 1906 172.16.27.232 (9 Nov 2009 14:25:05 GMT)
X-Complaints-To: news@mathworks.com
NNTP-Posting-Date: Mon, 9 Nov 2009 14:25:05 +0000 (UTC)
X-Face: $Ahg}Iylezql"r1WV1Me5&)ng"a4v%D>==KMs-elCfj"o}$bh-VOt7lVXgLWsC?9mZ`mINT
 G6PDvca;nrgs$lfcr0l1ew'N]>nXKl}m|Zpg>,6*gLp~-N0N2*+b.iwv=u>@R$L4SEG&NYUU;lSR@u
 IHphdAy
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.2 (gnu/linux)
Cancel-Lock: sha1:p0DlDxwxK7PTZS+xvqoJt0cHq4o=
Xref: news.mathworks.com comp.soft-sys.matlab:583551


"Mr. CFD" <s2108860@student.rmit.edu.au> writes:

> Edric M Ellis <eellis@mathworks.com> wrote in message <ytwaayzg4tl.fsf@uk-eellis-deb5-64.mathworks.co.uk>...
>> "Mr. CFD" <s2108860@student.rmit.edu.au> writes:
>> 
>> > I have a simulation running on a cluster using the parfor command. The
>> > simulation has previously run successfully with no problems, but today it was
>> > terminated mid-way with the following error:
>> >
>> > "The session that parfor is using has shut down"
>> >
>> > Upon further inspection, I have traced the parfor statement within the code
>> > which the error is referring to. I'm at a complete loss to explain the cause
>> > of this error. Especially, since the code has been used successfully in
>> > previous occasions, I have no idea how to deal with this problem. What can
>> > cause the parfor command to &#8216;shut-down&#8217;? Any advice please.
>> 
>> Are you using an interactive MATLABPOOL (i.e. calling "matlabpool open ..." in
>> your desktop MATLAB session)?
>> 
>> That error message literally means that the pool has been closed unexpectedly -
>> the connection to the workers simply disappeared. This could happen if a worker
>> crashed for example.
>> 
>> What sort of cluster are you running on? (Many clusters have various resource
>> usage limits after which they terminate jobs - maybe you're hitting one of
>> those?)
>> 
>> Cheers,
>> 
>> Edric.
>
> Hi Edric, The job is run on an external supercomputing cluster. Parfor
> commands are used and the 'createMatlabPoolJob' scheduler is applied. So each
> index of the parfor command is simulated by 'n' CPUs to accelerate
> computation.  I have run this same simulation on many occasions with no
> problems thus, for this error to shown an appearance now is strange and
> difficult to reason. Do you have any suggestions of what could be happening
> and how it can be avoided please?

Ok, thanks for the info. If precisely the same matlabpool job previously used to
run correctly, and now reliably fails, that might indicate that something has
changed on your cluster. My first suspicion would be that limits on resource
usage may have changed - e.g. the walltime or memory limit for jobs.

Do you know if the job fails at exactly the same point each time? Do you know
how long (in terms of elapsed time) through your job things fall over? Can you
run a smaller problem on the cluster successfully? (E.g. can you reduce the data
size / number of iterations and have it work?)

One other thing to look out for - are there any matlab_crash_dump.* files
created in your home directory when the job runs? Is there anything in the
"debug log" of the job? (How you get to that depends on the scheduler).

Cheers,

Edric.