Path: news.mathworks.com!not-for-mail
From: Edric M Ellis <eellis@mathworks.com>
Newsgroups: comp.soft-sys.matlab
Subject: Re: Parallel configuration validation in SGE env
Date: Thu, 12 Nov 2009 11:47:36 +0000
Organization: The Mathworks, Ltd.
Lines: 56
Message-ID: <ytw639gc6cn.fsf@uk-eellis-deb5-64.mathworks.co.uk>
References: <hcrjih$ep5$1@fred.mathworks.com> <ytwvdhqfude.fsf@uk-eellis-deb5-64.mathworks.co.uk> <hcrsl6$8d9$1@fred.mathworks.com> <ytwr5sefgln.fsf@uk-eellis-deb5-64.mathworks.co.uk> <hcuglu$iq4$1@fred.mathworks.com> <ytwiqdpf7dl.fsf@uk-eellis-deb5-64.mathworks.co.uk> <hcun9r$c1h$1@fred.mathworks.com> <ytweiocf1nk.fsf@uk-eellis-deb5-64.mathworks.co.uk> <hdefi4$c27$1@fred.mathworks.com> <ytwaaytcaq4.fsf@uk-eellis-deb5-64.mathworks.co.uk> <hdgp1a$i9p$1@fred.mathworks.com>
NNTP-Posting-Host: uk-eellis-deb5-64.mathworks.co.uk
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: fred.mathworks.com 1258026456 18394 172.16.27.232 (12 Nov 2009 11:47:36 GMT)
X-Complaints-To: news@mathworks.com
NNTP-Posting-Date: Thu, 12 Nov 2009 11:47:36 +0000 (UTC)
X-Face: $Ahg}Iylezql"r1WV1Me5&)ng"a4v%D>==KMs-elCfj"o}$bh-VOt7lVXgLWsC?9mZ`mINT
 G6PDvca;nrgs$lfcr0l1ew'N]>nXKl}m|Zpg>,6*gLp~-N0N2*+b.iwv=u>@R$L4SEG&NYUU;lSR@u
 IHphdAy
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.2 (gnu/linux)
Cancel-Lock: sha1:+regGcAN0hIId61Tburh34a7W4I=
Xref: news.mathworks.com comp.soft-sys.matlab:584502


"Rafael " <rafael.fritz@physik.uni-marburg.de> writes:

>> Yes, ssh/rsh onto one of the cluster nodes, and run something like
>> 
>> /path/to/matlab/bin/worker
>> 
>> That should print stuff out and then exit with an error message (because it
>> doesn't know what job to execute - that's what I'd expect)
>
> Great, I think here is some useful information about the decode function!
> I got the following error after starting /local/matlab/bin/worker on a computing node:
> -------------------------------------
> < M A T L A B (R) > [.... usual startup stuff ...]
> /home/fritzra
>
> Error converting the environement variable MDCE_DECODE_FUNCTION to a function handle.
> This is probably because the environement variable (MDCE_DECODE_FUNCTION) does not exist.
> The MDCE_DECODE_FUNCTION variable's current value is ""
> Error returned was:
> Invalid function name ''
> Killed
> --------------------------------
> So I thinks that's very good to know and I should check the setting of the
> variable MDCE_DECODE_FUNCTION in the sgeSubmitFcn.m, right?!?

Actually, that error message is normal for running "worker" in that way. I was
just trying to make sure that there wasn't something mysterious going on on the
cluster that stopped "worker" from launching at all, which it looks like there
isn't.

Unless you've changed something, the sgeSubmitFcn.m should already arrange for
the right MDCE_DECODE_FUNCTION to be set. There should be a line in there saying
something like

setenv('MDCE_DECODE_FUNCTION', 'sgeDecodeFunc'); 

and hopefully sgeDecodeFunc is on the default path of the workers. (This part of
the puzzle is the same as for parallel jobs, so it's hard to understand why they
work and distributed jobs do not).

I'm not that familiar with SGE, but I do know that for parallel jobs theres a
"parallel environment" argument (-pe matlab N) which modifies some stuff to do
with the job. 

Perhaps we should take another direction: perhaps you could try executing
something simple like this in a shell to see what happens:

export MDCE_MATLAB_EXE=/path/to/matlab/bin/worker
qsub -N testJob -j yes -o /path/to/logfile -v MDCE_MATLAB_EXE /path/to/sgeWrapper.sh

and see whether that succeeds? That should be basically the same as what
sgeSubmitFcn.m is doing.

Cheers,

Edric.