Path: news.mathworks.com!not-for-mail
From: "Rafael " <rafael.fritz@physik.uni-marburg.de>
Newsgroups: comp.soft-sys.matlab
Subject: Re: Parallel configuration validation in SGE env
Date: Wed, 4 Nov 2009 12:05:04 +0000 (UTC)
Organization: Universit&#228;t Marburg
Lines: 42
Message-ID: <hcrqlg$pjd$1@fred.mathworks.com>
References: <hcrjih$ep5$1@fred.mathworks.com> <ytwvdhqfude.fsf@uk-eellis-deb5-64.mathworks.co.uk>
Reply-To: "Rafael " <rafael.fritz@physik.uni-marburg.de>
NNTP-Posting-Host: webapp-02-blr.mathworks.com
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 8bit
X-Trace: fred.mathworks.com 1257336304 26221 172.30.248.37 (4 Nov 2009 12:05:04 GMT)
X-Complaints-To: news@mathworks.com
NNTP-Posting-Date: Wed, 4 Nov 2009 12:05:04 +0000 (UTC)
X-Newsreader: MATLAB Central Newsreader 1634821
Xref: news.mathworks.com comp.soft-sys.matlab:582340


Edric M Ellis <eellis@mathworks.com> wrote in message <ytwvdhqfude.fsf@uk-eellis-deb5-64.mathworks.co.uk>...
> "Rafael " <rafael.fritz@physik.uni-marburg.de> writes:
> 
> > I've been able to find the right configurations for the parallel Matlab
> > computing on our linux cluster, using the sun grid engine SGE scheduler.  Now,
> > if I try to validate these configurations the findResource part passes and
> > also the parallel part and the matlabpool.  But there always occurs a failure
> > with distributed jobs!  Why?  Thanks very much for your suggestions!
> 
> Very strange - usually if parallel and matlabpool jobs are working, that's the
> hardest part. Is there any output from the validation that you could post here?

Thanks, Edric, for your suggestions.
The output of the validation of the distributed job seems not very unusual:
"Stage: Distributed Job
Status: Failed
Description:  The given stage reached the default or user-specified timeout.
Command Line Output:
Submitting task 1
Job output will be written to: /home/fritzra/matlab/hello_test_files/Job13_Task1.out
QSUB output: Your job 1858603 ("Job13.1") has been submitted
Error Report: (none)
Debug Log: (none)"

In the scheduler I can see where the job has gone and look there for the process. Matlab starts there and runs for the maximum time I configured in the sgeSubmitFcn.m.
Parallel and matlabpool look similar, but work well with the config validation.
In the docs of PCTB I found a note, that this distributed parts fails with a mpiexec scheduler. Dont know what that is...

> Or perhaps you could try something like this:
> 
> s = findResource( .... ); % get your scheduler
> j = s.createJob;
> j.createTask( @matlabroot, 1 );
> j.createTask( @matlabroot, 1 );
> j.submit;
> j.wait(); s.getDebugLog( j )
> 
> and post the output.

I will do so in about 2 hours!
Thanks very much so far,
Rafael