Thread Subject: Parallel configuration validation in SGE env

Subject: Parallel configuration validation in SGE env

From: Rafael

Date: 4 Nov, 2009 10:04:01

Message: 1 of 23

Hi all,
I've been able to find the right configurations for the parallel Matlab computing on our linux cluster, using the sun grid engine SGE scheduler.
Now, if I try to validate these configurations the findResource part passes and also the parallel part and the matlabpool.
But there always occurs a failure with distributed jobs!
Why?
Thanks very much for your suggestions!

Subject: Parallel configuration validation in SGE env

From: Edric M Ellis

Date: 4 Nov, 2009 10:37:49

Message: 2 of 23

"Rafael " <rafael.fritz@physik.uni-marburg.de> writes:

> I've been able to find the right configurations for the parallel Matlab
> computing on our linux cluster, using the sun grid engine SGE scheduler. Now,
> if I try to validate these configurations the findResource part passes and
> also the parallel part and the matlabpool. But there always occurs a failure
> with distributed jobs! Why? Thanks very much for your suggestions!

Very strange - usually if parallel and matlabpool jobs are working, that's the
hardest part. Is there any output from the validation that you could post here?
Or perhaps you could try something like this:

s = findResource( .... ); % get your scheduler
j = s.createJob;
j.createTask( @matlabroot, 1 );
j.createTask( @matlabroot, 1 );
j.submit;
j.wait(); s.getDebugLog( j )

and post the output.

Cheers,

Edric.

Subject: Parallel configuration validation in SGE env

From: Rafael

Date: 4 Nov, 2009 12:05:04

Message: 3 of 23

Edric M Ellis <eellis@mathworks.com> wrote in message <ytwvdhqfude.fsf@uk-eellis-deb5-64.mathworks.co.uk>...
> "Rafael " <rafael.fritz@physik.uni-marburg.de> writes:
>
> > I've been able to find the right configurations for the parallel Matlab
> > computing on our linux cluster, using the sun grid engine SGE scheduler. Now,
> > if I try to validate these configurations the findResource part passes and
> > also the parallel part and the matlabpool. But there always occurs a failure
> > with distributed jobs! Why? Thanks very much for your suggestions!
>
> Very strange - usually if parallel and matlabpool jobs are working, that's the
> hardest part. Is there any output from the validation that you could post here?

Thanks, Edric, for your suggestions.
The output of the validation of the distributed job seems not very unusual:
"Stage: Distributed Job
Status: Failed
Description: The given stage reached the default or user-specified timeout.
Command Line Output:
Submitting task 1
Job output will be written to: /home/fritzra/matlab/hello_test_files/Job13_Task1.out
QSUB output: Your job 1858603 ("Job13.1") has been submitted
Error Report: (none)
Debug Log: (none)"

In the scheduler I can see where the job has gone and look there for the process. Matlab starts there and runs for the maximum time I configured in the sgeSubmitFcn.m.
Parallel and matlabpool look similar, but work well with the config validation.
In the docs of PCTB I found a note, that this distributed parts fails with a mpiexec scheduler. Dont know what that is...

> Or perhaps you could try something like this:
>
> s = findResource( .... ); % get your scheduler
> j = s.createJob;
> j.createTask( @matlabroot, 1 );
> j.createTask( @matlabroot, 1 );
> j.submit;
> j.wait(); s.getDebugLog( j )
>
> and post the output.

I will do so in about 2 hours!
Thanks very much so far,
Rafael

Subject: Parallel configuration validation in SGE env

From: Rafael

Date: 4 Nov, 2009 12:39:02

Message: 4 of 23

Edric M Ellis <eellis@mathworks.com> wrote in message <ytwvdhqfude.fsf@uk-eellis-deb5-64.mathworks.co.uk>...
> "Rafael " <rafael.fritz@physik.uni-marburg.de> writes:
>
> > I've been able to find the right configurations for the parallel Matlab
> > computing on our linux cluster, using the sun grid engine SGE scheduler. Now,
> > if I try to validate these configurations the findResource part passes and
> > also the parallel part and the matlabpool. But there always occurs a failure
> > with distributed jobs! Why? Thanks very much for your suggestions!
>
> Very strange - usually if parallel and matlabpool jobs are working, that's the
> hardest part. Is there any output from the validation that you could post here?
> Or perhaps you could try something like this:
>
> s = findResource( .... ); % get your scheduler
> j = s.createJob;
> j.createTask( @matlabroot, 1 );
> j.createTask( @matlabroot, 1 );
> j.submit;
> j.wait(); s.getDebugLog( j )
>
> and post the output.

So, I've tried this code and found submission and start of Matlab at the working nodes. But there has been no workload for the whole running time I configured (20min).
Then I interrupted with strg+c and looked for the logfile but didn't get one.
Here some output:

"Submitting task 1
Job output will be written to: /home/fritzra/matlab/hello_test_files/Job16_Task1.out
QSUB output: Your job 1858617 ("Job16.1") has been submitted
Submitting task 2
Job output will be written to: /home/fritzra/matlab/hello_test_files/Job16_Task2.out
QSUB output: Your job 1858618 ("Job16.2") has been submitted
??? Error using ==> distcomp.abstractjob.wait at 45
>> s.getDebugLog( j )
??? No appropriate method, property, or field getDebugLog for class
distcomp.genericscheduler."

Don't know what happens.
Why are there Matlab worker sessions starting without actually working?
How do I get the debugLog?
Thanks, Rafael

Subject: Parallel configuration validation in SGE env

From: Edric M Ellis

Date: 4 Nov, 2009 15:35:16

Message: 5 of 23

"Rafael " <rafael.fritz@physik.uni-marburg.de> writes:

>> Very strange - usually if parallel and matlabpool jobs are working, that's the
>> hardest part. Is there any output from the validation that you could post here?
>> Or perhaps you could try something like this:
>>
>> s = findResource( .... ); % get your scheduler
>> j = s.createJob;
>> j.createTask( @matlabroot, 1 );
>> j.createTask( @matlabroot, 1 );
>> j.submit;
>> j.wait(); s.getDebugLog( j )
>>
>> and post the output.
>
> So, I've tried this code and found submission and start of Matlab at the
> working nodes. But there has been no workload for the whole running time I
> configured (20min). Then I interrupted with strg+c and looked for the logfile
> but didn't get one. Here some output:
>
> "Submitting task 1
> Job output will be written to: /home/fritzra/matlab/hello_test_files/Job16_Task1.out
> QSUB output: Your job 1858617 ("Job16.1") has been submitted
> Submitting task 2
> Job output will be written to: /home/fritzra/matlab/hello_test_files/Job16_Task2.out
> QSUB output: Your job 1858618 ("Job16.2") has been submitted
> ??? Error using ==> distcomp.abstractjob.wait at 45
>>> s.getDebugLog( j )
> ??? No appropriate method, property, or field getDebugLog for class
> distcomp.genericscheduler."

Sorry, my mistake - I forgot that the generic scheduler doesn't have a
getDebugLog method - all it would do is print the contents of the output files,
like these:

/home/fritzra/matlab/hello_test_files/Job16_Task1.out

Is there anything interesting in there?

Cheers,

Edric.

Subject: Parallel configuration validation in SGE env

From: Rafael

Date: 5 Nov, 2009 12:33:02

Message: 6 of 23

Edric M Ellis <eellis@mathworks.com> wrote in message <ytwr5sefgln.fsf@uk-eellis-deb5-64.mathworks.co.uk>...

> Sorry, my mistake - I forgot that the generic scheduler doesn't have a
> getDebugLog method - all it would do is print the contents of the output files,
> like these:
>
> /home/fritzra/matlab/hello_test_files/Job16_Task1.out
>
> Is there anything interesting in there?
>
> Cheers,
>
> Edric.

Let's see - in Job16_Task1.out one finds just the following:
"Executing: /local/matlab/bin/worker "
Thats always after starting distributed jobs.
But never any other output like it is the case for parallel jobs where I get something like Job15.mpiexec.out with content like "starting smpd on hosts ..." and so on.
This executed worker is just the unchanged worker script given by mathworks.
So, not really interesting content in this output... ?!?

Subject: Parallel configuration validation in SGE env

From: Edric M Ellis

Date: 5 Nov, 2009 13:06:46

Message: 7 of 23

"Rafael " <rafael.fritz@physik.uni-marburg.de> writes:

> Edric M Ellis <eellis@mathworks.com> wrote in message <ytwr5sefgln.fsf@uk-eellis-deb5-64.mathworks.co.uk>...
>
>> Sorry, my mistake - I forgot that the generic scheduler doesn't have a
>> getDebugLog method - all it would do is print the contents of the output files,
>> like these:
>>
>> /home/fritzra/matlab/hello_test_files/Job16_Task1.out
>>
>> Is there anything interesting in there?
>>
>> Cheers,
>>
>> Edric.
>
> Let's see - in Job16_Task1.out one finds just the following:
>
> "Executing: /local/matlab/bin/worker "
>
> Thats always after starting distributed jobs. But never any other output like
> it is the case for parallel jobs where I get something like Job15.mpiexec.out
> with content like "starting smpd on hosts ..." and so on. This executed
> worker is just the unchanged worker script given by mathworks. So, not really
> interesting content in this output... ?!?

That's really strange. I would expect to see at least the MATLAB startup banner
text and so on, even if there was something else going wrong. I assume that
"/local/matlab/bin/worker" is the right location on the cluster (otherwise
presumably the parallel stuff wouldn't work).

Is there any chance you could work out which node on the cluster your
distributed job is being scheduled onto and trying to run
"/local/matlab/bin/worker" there? It wont do anything terribly useful, but would
at least confirm that MATLAB can start up there... (You could add a "hostname"
command to the line before the "exec" in sgeWrapper.sh to find out where the job
is running).

Cheers,

Edric.

Subject: Parallel configuration validation in SGE env

From: Rafael

Date: 5 Nov, 2009 14:26:03

Message: 8 of 23

Edric M Ellis <eellis@mathworks.com> wrote in message <ytwiqdpf7dl.fsf@uk-eellis-deb5-64.mathworks.co.uk>...

> That's really strange. I would expect to see at least the MATLAB startup banner
> text and so on, even if there was something else going wrong. I assume that
> "/local/matlab/bin/worker" is the right location on the cluster (otherwise
> presumably the parallel stuff wouldn't work).
>
> Is there any chance you could work out which node on the cluster your
> distributed job is being scheduled onto and trying to run
> "/local/matlab/bin/worker" there? It wont do anything terribly useful, but would
> at least confirm that MATLAB can start up there... (You could add a "hostname"
> command to the line before the "exec" in sgeWrapper.sh to find out where the job
> is running).

I did check that using "qstat" to look where my job is distributed to in the cluster. I can then go to this working node using ssh and check the running processes and there I find a Matlab process started, but not really doing work. That process runs for the whole time of 20 minutes which I've previously configured in the submit command.
So its starting up, but not doing anything.
Or at least not doing what it should do.

Subject: Parallel configuration validation in SGE env

From: Edric M Ellis

Date: 6 Nov, 2009 09:22:39

Message: 9 of 23

"Rafael " <rafael.fritz@physik.uni-marburg.de> writes:

> Edric M Ellis <eellis@mathworks.com> wrote in message <ytwiqdpf7dl.fsf@uk-eellis-deb5-64.mathworks.co.uk>...
>
>> That's really strange. I would expect to see at least the MATLAB startup banner
>> text and so on, even if there was something else going wrong. I assume that
>> "/local/matlab/bin/worker" is the right location on the cluster (otherwise
>> presumably the parallel stuff wouldn't work).
>>
>> Is there any chance you could work out which node on the cluster your
>> distributed job is being scheduled onto and trying to run
>> "/local/matlab/bin/worker" there? It wont do anything terribly useful, but would
>> at least confirm that MATLAB can start up there... (You could add a "hostname"
>> command to the line before the "exec" in sgeWrapper.sh to find out where the job
>> is running).
>
> I did check that using "qstat" to look where my job is distributed to in the
> cluster. I can then go to this working node using ssh and check the running
> processes and there I find a Matlab process started, but not really doing
> work. That process runs for the whole time of 20 minutes which I've previously
> configured in the submit command. So its starting up, but not doing anything.
> Or at least not doing what it should do.

That's very strange - normally, pretty early on in MATLAB startup, stuff gets
printed to stdout, so I'd expect to see that ending up somewhere. Did you try
launching the "worker" manually on the cluster node to see what happens?

Cheers,

Edric.

Subject: Parallel configuration validation in SGE env

From: Rafael

Date: 11 Nov, 2009 13:52:04

Message: 10 of 23

> That's very strange - normally, pretty early on in MATLAB startup, stuff gets
> printed to stdout, so I'd expect to see that ending up somewhere. Did you try
> launching the "worker" manually on the cluster node to see what happens?
>
> Cheers,
>
> Edric.

Hi and thanks for staying tuned...
Sorry, but:
what do you mean by "launching the worker manually on the cluster node"?
Should I log on there and do what?
Regards, Rafael

Subject: Parallel configuration validation in SGE env

From: Edric M Ellis

Date: 11 Nov, 2009 16:00:51

Message: 11 of 23

"Rafael " <rafael.fritz@physik.uni-marburg.de> writes:

>> That's very strange - normally, pretty early on in MATLAB startup, stuff gets
>> printed to stdout, so I'd expect to see that ending up somewhere. Did you try
>> launching the "worker" manually on the cluster node to see what happens?
>>
>> Cheers,
>>
>> Edric.
>
> Hi and thanks for staying tuned...
> Sorry, but:
> what do you mean by "launching the worker manually on the cluster node"?
> Should I log on there and do what?

Yes, ssh/rsh onto one of the cluster nodes, and run something like

/path/to/matlab/bin/worker

That should print stuff out and then exit with an error message (because it
doesn't know what job to execute - that's what I'd expect)

Cheers,

Edric.

Subject: Parallel configuration validation in SGE env

From: Rafael

Date: 12 Nov, 2009 10:46:02

Message: 12 of 23

> Yes, ssh/rsh onto one of the cluster nodes, and run something like
>
> /path/to/matlab/bin/worker
>
> That should print stuff out and then exit with an error message (because it
> doesn't know what job to execute - that's what I'd expect)

Great, I think here is some useful information about the decode function!
I got the following error after starting /local/matlab/bin/worker on a computing node:
-------------------------------------
< M A T L A B (R) >
                  Copyright 1984-2009 The MathWorks, Inc.
                Version 7.9.0.529 (R2009b) 64-bit (glnxa64)
                              August 12, 2009

Warning: Name is nonexistent or not a directory:
/local/matlab-7.9/help/toolbox/comm/examples.
Warning: Name is nonexistent or not a directory:
/local/matlab-7.9/help/toolbox/commblks/examples.
Warning: Name is nonexistent or not a directory:
/local/matlab-7.9/help/toolbox/dspblks/examples.
Warning: Name is nonexistent or not a directory:
/local/matlab-7.9/help/toolbox/vipblks/examples.

  To get started, type one of these: helpwin, helpdesk, or demo.
  For product information, visit www.mathworks.com.


/home/fritzra

Error converting the environement variable MDCE_DECODE_FUNCTION to a function handle.
This is probably because the environement variable (MDCE_DECODE_FUNCTION) does not exist.
The MDCE_DECODE_FUNCTION variable's current value is ""
Error returned was:
Invalid function name ''
Killed
--------------------------------
So I thinks that's very good to know and I should check the setting of the variable MDCE_DECODE_FUNCTION in the sgeSubmitFcn.m, right?!?

Regards, Rafael

Subject: Parallel configuration validation in SGE env

From: Rafael

Date: 12 Nov, 2009 11:11:03

Message: 13 of 23

> So I thinks that's very good to know and I should check the setting of the variable MDCE_DECODE_FUNCTION in the sgeSubmitFcn.m, right?!?
>
> Regards, Rafael

No, that does not make sense, because I did not use any sgeSubmitFcn.m in this case, did I ?!?

Rafael

Subject: Parallel configuration validation in SGE env

From: Edric M Ellis

Date: 12 Nov, 2009 11:47:36

Message: 14 of 23

"Rafael " <rafael.fritz@physik.uni-marburg.de> writes:

>> Yes, ssh/rsh onto one of the cluster nodes, and run something like
>>
>> /path/to/matlab/bin/worker
>>
>> That should print stuff out and then exit with an error message (because it
>> doesn't know what job to execute - that's what I'd expect)
>
> Great, I think here is some useful information about the decode function!
> I got the following error after starting /local/matlab/bin/worker on a computing node:
> -------------------------------------
> < M A T L A B (R) > [.... usual startup stuff ...]
> /home/fritzra
>
> Error converting the environement variable MDCE_DECODE_FUNCTION to a function handle.
> This is probably because the environement variable (MDCE_DECODE_FUNCTION) does not exist.
> The MDCE_DECODE_FUNCTION variable's current value is ""
> Error returned was:
> Invalid function name ''
> Killed
> --------------------------------
> So I thinks that's very good to know and I should check the setting of the
> variable MDCE_DECODE_FUNCTION in the sgeSubmitFcn.m, right?!?

Actually, that error message is normal for running "worker" in that way. I was
just trying to make sure that there wasn't something mysterious going on on the
cluster that stopped "worker" from launching at all, which it looks like there
isn't.

Unless you've changed something, the sgeSubmitFcn.m should already arrange for
the right MDCE_DECODE_FUNCTION to be set. There should be a line in there saying
something like

setenv('MDCE_DECODE_FUNCTION', 'sgeDecodeFunc');

and hopefully sgeDecodeFunc is on the default path of the workers. (This part of
the puzzle is the same as for parallel jobs, so it's hard to understand why they
work and distributed jobs do not).

I'm not that familiar with SGE, but I do know that for parallel jobs theres a
"parallel environment" argument (-pe matlab N) which modifies some stuff to do
with the job.

Perhaps we should take another direction: perhaps you could try executing
something simple like this in a shell to see what happens:

export MDCE_MATLAB_EXE=/path/to/matlab/bin/worker
qsub -N testJob -j yes -o /path/to/logfile -v MDCE_MATLAB_EXE /path/to/sgeWrapper.sh

and see whether that succeeds? That should be basically the same as what
sgeSubmitFcn.m is doing.

Cheers,

Edric.

Subject: Parallel configuration validation in SGE env

From: Rafael

Date: 12 Nov, 2009 16:15:18

Message: 15 of 23

> Actually, that error message is normal for running "worker" in that way. I was
> just trying to make sure that there wasn't something mysterious going on on the
> cluster that stopped "worker" from launching at all, which it looks like there
> isn't.
>
> Unless you've changed something, the sgeSubmitFcn.m should already arrange for
> the right MDCE_DECODE_FUNCTION to be set. There should be a line in there saying
> something like
>
> setenv('MDCE_DECODE_FUNCTION', 'sgeDecodeFunc');
>
> and hopefully sgeDecodeFunc is on the default path of the workers. (This part of
> the puzzle is the same as for parallel jobs, so it's hard to understand why they
> work and distributed jobs do not).
>
> I'm not that familiar with SGE, but I do know that for parallel jobs theres a
> "parallel environment" argument (-pe matlab N) which modifies some stuff to do
> with the job.
>
> Perhaps we should take another direction: perhaps you could try executing
> something simple like this in a shell to see what happens:
>
> export MDCE_MATLAB_EXE=/path/to/matlab/bin/worker
> qsub -N testJob -j yes -o /path/to/logfile -v MDCE_MATLAB_EXE /path/to/sgeWrapper.sh

Hi,
I did so: set the EXE variable and submitted sgeWrapper.sh using the described parameters. There has been generated an output-file, but it's just empty. Sorry.
Rafael

Subject: Parallel configuration validation in SGE env

From: Edric M Ellis

Date: 12 Nov, 2009 16:34:33

Message: 16 of 23

"Rafael " <rafael.fritz@physik.uni-marburg.de> writes:

>> Perhaps we should take another direction: perhaps you could try executing
>> something simple like this in a shell to see what happens:
>>
>> export MDCE_MATLAB_EXE=/path/to/matlab/bin/worker
>> qsub -N testJob -j yes -o /path/to/logfile -v MDCE_MATLAB_EXE /path/to/sgeWrapper.sh
>
> Hi, I did so: set the EXE variable and submitted sgeWrapper.sh using the
> described parameters. There has been generated an output-file, but it's just
> empty. Sorry. Rafael

Well, that would appear to be the problem - somehow, when sgeWrapper.sh is
launched on the cluster, the worker doesn't get launched. Could you try creating
a shell script like this:

##############################################################################
#!/bin/sh
#$ -S /bin/sh
#$ -v MDCE_MATLAB_EXE

echo "Here's the environment:"
env
echo "Here's what happens when we launch the worker:"
${MDCE_MATLAB_EXE}

##############################################################################

and then submitting it using qsub as before.

I'm surprised that not even the "Executing: " line from sgeWrapper.sh is being
printed to your output file...

Cheers,

Edric.

Subject: Parallel configuration validation in SGE env

From: Rafael

Date: 13 Nov, 2009 11:05:04

Message: 17 of 23

##############################################################################
> #!/bin/sh
> #$ -S /bin/sh
> #$ -v MDCE_MATLAB_EXE
>
> echo "Here's the environment:"
> env
> echo "Here's what happens when we launch the worker:"
> ${MDCE_MATLAB_EXE}
>
> ##############################################################################
>
> and then submitting it using qsub as before.
>
> I'm surprised that not even the "Executing: " line from sgeWrapper.sh is being
> printed to your output file...

So am I...
But sorry, still no output in the generated file.
I wrote the my cluster admin for any suggestions.
This simple script should give output in any way.
The included commands give output, if I manually go to the nodes and type it.
Regards, Rafael

Subject: Parallel configuration validation in SGE env

From: Rafael

Date: 13 Nov, 2009 12:16:02

Message: 18 of 23

Hi,
with some help of my cluster admin I got output into a logfile as you can see below.
Some information about the environment, but no worker start at the far end...
>>>>>>>>>>>>>>
Here's the environment:
MDCE_MATLAB_EXE=/local/matlab/bin/worker
HOSTNAME=node012
SGE_TASK_STEPSIZE=undefined
SHELL=/bin/sh
NHOSTS=1
SGE_O_WORKDIR=/home/fritzra
TMPDIR=/scratch/1864162.1.serial_test
SSH_CLIENT=172.26.6.240 38108 22
SGE_O_HOME=/home/fritzra
SGE_ARCH=lx24-amd64
SGE_CELL=default
MPICH_PROCESS_GROUP=no
RESTARTED=0
ARC=lx24-amd64
USER=fritzra
SGE_TASK_LAST=undefined
QUEUE=serial_test
SGE_TASK_ID=undefined
SGE_BINARY_PATH=/usr/local/sge/bin/lx24-amd64
MAIL=/var/mail/root
PATH=/scratch/1864162.1.serial_test:/usr/local/bin:/bin:/usr/bin
SGE_STDERR_PATH=/home/fritzra/manuellerTest.e1864162
PWD=/home/fritzra
SGE_STDOUT_PATH=/home/fritzra/manuellerTest.o1864162
SGE_ACCOUNT=sge
JOB_SCRIPT=/var/spool/sge/node012/job_scripts/1864162
JOB_NAME=manuellerTest
SGE_ROOT=/usr/local/sge
REQNAME=manuellerTest
P4_RSHCOMMAND=rsh
SGE_JOB_SPOOL_DIR=/var/spool/sge/node012/active_jobs/1864162.1
ENVIRONMENT=BATCH
SHLVL=3
HOME=/home/fritzra
SGE_CWD_PATH=/home/fritzra
NQUEUES=1
SGE_O_LOGNAME=fritzra
SGE_O_MAIL=/var/mail/fritzra
TMP=/scratch/1864162.1.serial_test
JOB_ID=1864162
LOGNAME=fritzra
SSH_CONNECTION=172.26.6.240 38108 172.26.6.12 22
SGE_TASK_FIRST=undefined
SGE_O_PATH=/opt/intel/fce/11.0.081/bin/intel64:/opt/intel/cce/11.0.081/bin/intel64:/usr/pgi/linux86-64/8.0/bin:/usr/local/sge/bin/lx24-amd64:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games
SGE_O_HOST=marc-hn
SGE_O_SHELL=/bin/bash
REQUEST=manuellerTest
NSLOTS=1
SGE_STDIN_PATH=/dev/null
_=/usr/bin/env
Here's what happens when we launch the worker:

>>>>>>>>>>>>>>>>>>>>>>
I did not cut any output here...

Subject: Parallel configuration validation in SGE env

From: Edric M Ellis

Date: 13 Nov, 2009 14:20:46

Message: 19 of 23

"Rafael " <rafael.fritz@physik.uni-marburg.de> writes:

> Hi,
> with some help of my cluster admin I got output into a logfile as you can see below.
> Some information about the environment, but no worker start at the far end...
>>>>>>>>>>>>>>>
> Here's the environment:
> MDCE_MATLAB_EXE=/local/matlab/bin/worker
> HOSTNAME=node012
> SGE_TASK_STEPSIZE=undefined
> SHELL=/bin/sh
> NHOSTS=1
> SGE_O_WORKDIR=/home/fritzra
> TMPDIR=/scratch/1864162.1.serial_test
> SSH_CLIENT=172.26.6.240 38108 22
> SGE_O_HOME=/home/fritzra
> SGE_ARCH=lx24-amd64
> SGE_CELL=default
> MPICH_PROCESS_GROUP=no
> RESTARTED=0
> ARC=lx24-amd64
> USER=fritzra
> SGE_TASK_LAST=undefined
> QUEUE=serial_test
> SGE_TASK_ID=undefined
> SGE_BINARY_PATH=/usr/local/sge/bin/lx24-amd64
> MAIL=/var/mail/root
> PATH=/scratch/1864162.1.serial_test:/usr/local/bin:/bin:/usr/bin
> SGE_STDERR_PATH=/home/fritzra/manuellerTest.e1864162
> PWD=/home/fritzra
> SGE_STDOUT_PATH=/home/fritzra/manuellerTest.o1864162
> SGE_ACCOUNT=sge
> JOB_SCRIPT=/var/spool/sge/node012/job_scripts/1864162
> JOB_NAME=manuellerTest
> SGE_ROOT=/usr/local/sge
> REQNAME=manuellerTest
> P4_RSHCOMMAND=rsh
> SGE_JOB_SPOOL_DIR=/var/spool/sge/node012/active_jobs/1864162.1
> ENVIRONMENT=BATCH
> SHLVL=3
> HOME=/home/fritzra
> SGE_CWD_PATH=/home/fritzra
> NQUEUES=1
> SGE_O_LOGNAME=fritzra
> SGE_O_MAIL=/var/mail/fritzra
> TMP=/scratch/1864162.1.serial_test
> JOB_ID=1864162
> LOGNAME=fritzra
> SSH_CONNECTION=172.26.6.240 38108 172.26.6.12 22
> SGE_TASK_FIRST=undefined
> SGE_O_PATH=/opt/intel/fce/11.0.081/bin/intel64:/opt/intel/cce/11.0.081/bin/intel64:/usr/pgi/linux86-64/8.0/bin:/usr/local/sge/bin/lx24-amd64:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games
> SGE_O_HOST=marc-hn
> SGE_O_SHELL=/bin/bash
> REQUEST=manuellerTest
> NSLOTS=1
> SGE_STDIN_PATH=/dev/null
> _=/usr/bin/env
> Here's what happens when we launch the worker:
>
>>>>>>>>>>>>>>>>>>>>>>>
> I did not cut any output here...

Hmm, I'm afraid I'm all out of ideas here. I can't see anything there that might
prevent the worker from starting up. I think at this stage you might be best
contacting our install support group, as they have more experience with getting
things working on SGE.

Cheers,

Edric.

Subject: Parallel configuration validation in SGE env

From: Rafael

Date: 15 Nov, 2009 13:12:01

Message: 20 of 23

...
> > Here's what happens when we launch the worker:
> >
> >>>>>>>>>>>>>>>>>>>>>>>
> > I did not cut any output here...
>
> Hmm, I'm afraid I'm all out of ideas here. I can't see anything there that might
> prevent the worker from starting up. I think at this stage you might be best
> contacting our install support group, as they have more experience with getting
> things working on SGE.
>
> Cheers,
>
> Edric.

OK. Thats no problem.
Thanks very much for all your help!
If we get this running at any time, I will come back and add it here.
Regards, Rafael

Subject: Parallel configuration validation in SGE env

From: Marcin

Date: 17 Nov, 2009 21:30:18

Message: 21 of 23

"Rafael " <rafael.fritz@physik.uni-marburg.de> wrote in message <hcrjih$ep5$1@fred.mathworks.com>...
> Hi all,
> I've been able to find the right configurations for the parallel Matlab computing on our linux cluster, using the sun grid engine SGE scheduler.
> Now, if I try to validate these configurations the findResource part passes and also the parallel part and the matlabpool.
> But there always occurs a failure with distributed jobs!
> Why?
> Thanks very much for your suggestions!

Hi guys,

I have a similar problem. I'm trying to run MATLAB on SGE, my configuration passes the validation but when I issue "matlabpool" I am only able to connect to a single lab. When I try matlabpool 2 etc. I always get an error and I'm running out of ideas... Did you manage to solve your problem?

Subject: Parallel configuration validation in SGE env

From: Rafael

Date: 18 Nov, 2009 08:54:18

Message: 22 of 23

> Hi guys,
>
> I have a similar problem. I'm trying to run MATLAB on SGE, my configuration passes the validation but when I issue "matlabpool" I am only able to connect to a single lab. When I try matlabpool 2 etc. I always get an error and I'm running out of ideas... Did you manage to solve your problem?

Hey,
my problem using distributed jobs still exists, but I am not that active in solving it, because there is matlabpool and parallel jobs working here.
Why your matlabpool does not work?
May be you could get your error message and paste it here so one can have a look at it?
Ragards, Rafael

Subject: Parallel configuration validation in SGE env

From: Marcin

Date: 18 Nov, 2009 10:41:18

Message: 23 of 23

"Rafael " <rafael.fritz@physik.uni-marburg.de> wrote in message <he0cnq$ja1$1@fred.mathworks.com>...
> > Hi guys,
> >
> > I have a similar problem. I'm trying to run MATLAB on SGE, my configuration passes the validation but when I issue "matlabpool" I am only able to connect to a single lab. When I try matlabpool 2 etc. I always get an error and I'm running out of ideas... Did you manage to solve your problem?
>
> Hey,
> my problem using distributed jobs still exists, but I am not that active in solving it, because there is matlabpool and parallel jobs working here.
> Why your matlabpool does not work?
> May be you could get your error message and paste it here so one can have a look at it?
> Ragards, Rafael

Hi,

Well it looks like this. When I issue "matlabpool 1", everything is fine. When I issue "matlabpool 2", I get the following output in matlab on the client machine:

Starting matlabpool using the 'SGE-smart@dec120' configuration ...
Your job 1664 ("Job1.1") has been submitted
Your job 1665 ("Job1.2") has been submitted

and it gets stuck there.

Now, when I try qstat on the head node, it says that Job1.1 is running all the time (which is good) but Job2.1 runs for a moment and then finishes. When I look at the log files for both tasks (see below) there is an error for Task2, but I have no idea what it means. When I try "matlabpool 3" etc. it's always the first task that seems to be fine and there is this error for all the rest. It doesn't depend on the node which is executing the task (so the same node works fine if it gets Task1 but fails if it gets Task2, 3 etc.) For the things to be even more complicated, my configuration passes the verification procedure without problems, although at the matlabpool stage I get "Connected to 1 lab", instead of 15.


---------------- Task1 -------------------------------------------
Executing: /opt/matlab/2009b/bin/worker -parallel

                            < M A T L A B (R) >
                  Copyright 1984-2009 The MathWorks, Inc.
                Version 7.9.0.529 (R2009b) 64-bit (glnxa64)
                              August 12, 2009


  To get started, type one of these: helpwin, helpdesk, or demo.
  For product information, visit www.mathworks.com.

About to construct the storage object using constructor "makeFileStorageObject" and location "/home/smart/PCWIN"
About to find job proxy using location "Job1"
About to find task proxy using location "Job1/Task1"
Completed pre-execution phase
About to pPreJobEvaluate
About to pPreTaskEvaluate
About to add job dependencies
About to call jobStartup
About to call taskStartup
About to get evaluation data
About to pInstantiatePool
Pool instatiation complete
About to call poolStartup
Begin task function
End task function

---------------- Task2 -------------------------------------------
Executing: /opt/matlab/2009b/bin/worker -parallel

                            < M A T L A B (R) >
                  Copyright 1984-2009 The MathWorks, Inc.
                Version 7.9.0.529 (R2009b) 64-bit (glnxa64)
                              August 12, 2009


  To get started, type one of these: helpwin, helpdesk, or demo.
  For product information, visit www.mathworks.com.

About to construct the storage object using constructor "makeFileStorageObject" and location "/home/smart/PCWIN"
About to find job proxy using location "Job1"
About to find task proxy using location "Job1/Task2"
Completed pre-execution phase
About to pPreJobEvaluate
About to pPreTaskEvaluate
Unexpected error in PreTaskEvaluate - MATLAB will now exit.
No appropriate method, property, or field pPreTaskEvaluate for class handle.handle.

Error in ==> dctEvaluateTask at 40
    task.pPreTaskEvaluate;

Error in ==> distcomp_evaluate_filetask>iDoTask at 96
    dctEvaluateTask(postFcns, finishFcn);

Error in ==> distcomp_evaluate_filetask at 38
    iDoTask(handlers, postFcns);

Tags for this Thread

Everyone's Tags:

Add a New Tag:

Separated by commas
Ex.: root locus, bode

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Tag Activity for This Thread
Tag Applied By Date/Time
pmode Marcin 18 Nov, 2009 08:23:45
matlabpool Marcin 18 Nov, 2009 08:23:29
distributed com... Marcin 18 Nov, 2009 08:23:23
parallel config... Marcin 18 Nov, 2009 08:23:20
sge Marcin 18 Nov, 2009 08:23:17
validation Rafael 4 Nov, 2009 05:09:02
distributed com... Rafael 4 Nov, 2009 05:09:02
sge Rafael 4 Nov, 2009 05:09:02
parallel config... Rafael 4 Nov, 2009 05:09:02
rssFeed for this Thread

Contact us at files@mathworks.com