Thread Subject: Error message distributed computing

Subject: Error message distributed computing

From: Steffen

Date: 13 Jul, 2009 08:00:17

Message: 1 of 8

Hello,

a weird error message is coming up recently and what makes it so frustrating and tough to analyze is the fact that it seems to occur at random.
I?ve created about 300 tasks (takes about 20hrs to finish) which are sent to a Linux server which distributes the jobs to 20 nodes. The system works fine and it returned correct data back too my client PC. However, from time to time I get a long warning (message below) presumably just before the Linux server wants to send the results back to my machine. My computer did not finish the script (in busy mode), probably still awaiting the results but the cluster finished all jobs. To me it seems like the cluster server can?t communicate with my machine (in the Job log-file on the server the job is on status-finished), but when i check the network communication manually everything seems fine. Any idea what could cause this, or any ideas about a possible work-around?

Many thanks in advance!!!


??? Error while evaluating TimerFcn for timer 'timer-1'
Unable to access files in directory /home/matlab/ on host 111.59.36.112 because of error
Failed to copy files from "/home/matlab//Job90.state.mat" on the host "111.59.36.112" to "E:\Matlab\cluster".
Command Output:
"Looking up host "111.59.36.112"
Connecting to 111.59.36.112 port 22
Server version: SSH-1.99-OpenSSH_3.9p1
We claim version: SSH-2.0-PuTTY_Release_0.60
Using SSH protocol version 2
Doing Diffie-Hellman group exchange
Doing Diffie-Hellman key exchange with hash SHA-1
Host key fingerprint is:
Initialised AES-256 SDCTR client->server encryption
Initialised HMAC-SHA1 client->server MAC algorithm
Initialised AES-256 SDCTR server->client encryption
Initialised HMAC-SHA1 server->client MAC algorithm
Reading private key file "D:\Dokumente und Einstellungen\private.ppk"
Pageant is running. Requesting keys.
Pageant has 1 SSH-2 keys
Pageant key #0 matches configured key file
Using username "abc".
Trying Pageant key #0
Authenticating with public key "rsa-key-20090703" from agent
Sending Pageant's response
Access granted
Opened channel for session
Started a shell/command
Using SCP1
Connected to 111.59.36.112

Sending file modes: C0644 8 Job90.state.mat
scp: E:\Matlab\cluster\Job90.state.mat: Cannot create file

Server sent command exit status 1
Disconnected: All channels closed
"

You will need to manually copy files from /home/matlab// on host 111.59.36.112
to the local directory E:\Matlab\cluster.
To stop seeing this message, cancel Job 90.

Subject: Error message distributed computing

From: shafriza

Date: 14 Jul, 2009 09:18:44

Message: 2 of 8

On Jul 13, 6:00 pm, "Steffen" <rile...@gmail.com> wrote:
> Hello,
>
> a weirderrormessage is coming up recently and what makes it so frustrating and tough to analyze is the fact that it seems to occur at random.
> I?ve created about 300 tasks (takes about 20hrs to finish) which are sent to a Linux server which distributes the jobs to 20 nodes. The system works fine and it returned correct data back too my client PC. However, from time to time I get a long warning (message below) presumably just before the Linux server wants to send the results back to my machine. My computer did not finish the script (in busy mode), probably still awaiting the results but the cluster finished all jobs. To me it seems like the cluster server can?t communicate with my machine (in the Job log-file on the server the job is on status-finished), but when i check the network communication manually everything seems fine. Any idea what could cause this, or any ideas about a possible work-around?
>
> Many thanks in advance!!!
>
> ???ErrorwhileevaluatingTimerFcnfor timer 'timer-1'
> Unable to access files in directory /home/matlab/ on host 111.59.36.112 because oferror
> Failed to copy files from "/home/matlab//Job90.state.mat" on the host "111.59.36.112" to "E:\Matlab\cluster".
> Command Output:
> "Looking up host "111.59.36.112"
> Connecting to 111.59.36.112 port 22
> Server version: SSH-1.99-OpenSSH_3.9p1
> We claim version: SSH-2.0-PuTTY_Release_0.60
> Using SSH protocol version 2
> Doing Diffie-Hellman group exchange
> Doing Diffie-Hellman key exchange with hash SHA-1
> Host key fingerprint is:
> Initialised AES-256 SDCTR client->server encryption
> Initialised HMAC-SHA1 client->server MAC algorithm
> Initialised AES-256 SDCTR server->client encryption
> Initialised HMAC-SHA1 server->client MAC algorithm
> Reading private key file "D:\Dokumente und Einstellungen\private.ppk"
> Pageant is running. Requesting keys.
> Pageant has 1 SSH-2 keys
> Pageant key #0 matches configured key file
> Using username "abc".
> Trying Pageant key #0
> Authenticating with public key "rsa-key-20090703" from agent
> Sending Pageant's response
> Access granted
> Opened channel for session
> Started a shell/command
> Using SCP1
> Connected to 111.59.36.112
>
> Sending file modes: C0644 8 Job90.state.mat
> scp: E:\Matlab\cluster\Job90.state.mat: Cannot create file
>
> Server sent command exit status 1
> Disconnected: All channels closed
> "
>
> You will need to manually copy files from /home/matlab// on host 111.59.36.112
> to the local directory E:\Matlab\cluster.
> To stop seeing this message, cancel Job 90.

Hi Stef,

I not able to answer your question, but I very interested to know what
caused this since I'm encountered the same problem. I send about 10
jobs each using 4 nodes and once in a I got the same error. I need to
cancel the job and do it again. I suspect desktop matlab is waiting
for the job to finish in this commad (waitForState(j);) but no signal
is being send because the communications were terminated. If the jobs
in the cluster have finished, any idea on how to retrieve the output
manually so that we did not lose the jobs?

Thanks,
shafriza

Subject: Error message distributed computing

From: Steffen

Date: 15 Jul, 2009 08:22:01

Message: 3 of 8

Hiho,

well, here is my actual work-around. I run my normal script, create the tasks and submit the job. Since the completion of a task takes about 20hrs in my case I stop the script right after the submit(job). I get a message from the Linux server as soon as the Job is done, and then I run a 2nd script with waitforState, getAllOutputArguments to retrieve the data. Never had a problem so far.
Disadvantage: It?s an additional manual step, which in my case is even an advantage because I can use my Matlab for other purposes in the meantime.
I?ve been playing around a bit with the copytocluster.m file and don?t see a problem, so I suspect the problem might arise from temporary communication problems (since the waitforstate checks every 10s or so, i have to get the error sooner or later) or it is a problem with the ssh.

Still hoping to get it running as it is supposed to be. Let me know if there are any ideas...

Steffen

Subject: Error message distributed computing

From: shafriza

Date: 15 Jul, 2009 23:18:00

Message: 4 of 8

On Jul 15, 6:22 pm, "Steffen" <rile...@gmail.com> wrote:
> Hiho,
>
> well, here is my actual work-around. I run my normal script, create the tasks and submit the job. Since the completion of a task takes about 20hrs in my case I stop the script right after the submit(job). I get a message from the Linux server as soon as the Job is done, and then I run a 2nd script with waitforState, getAllOutputArguments to retrieve the data. Never had a problem so far.
> Disadvantage: It?s an additional manual step, which in my case is even an advantage because I can use my Matlab for other purposes in the meantime.
> I?ve been playing around a bit with the copytocluster.m file and don?t see a problem, so I suspect the problem might arise from temporary communication problems (since the waitforstate checks every 10s or so, i have to get theerrorsooner or later) or it is a problem with the ssh.
>
> Still hoping to get it running as it is supposed to be. Let me know if there are any ideas...
>
> Steffen

Hi,

I'm also sending a similar question to MathWorks technical support and
would like to know what their advice. Will share it here once I got a
reply from them.

Meanwhile, I'm wondering how did you set an email notification when
the job is complete, since I'm running matlab DCS in the cluster using
a script instead of using a pbs file. Appreciatec if you could advice
on how to set the email notifications.

cheers,
shafriza

Subject: Error message distributed computing

From: Steffen

Date: 17 Jul, 2009 14:45:19

Message: 5 of 8

Simple bash-code, which sends me an email as soon as a certain Job is done at the cluster. It is still an additional maual step to retrieve the data.

Good luck and keep me posted in case you solve the problem...

Cheers,
Steffen

Subject: Error message distributed computing

From: shafriza

Date: 20 Jul, 2009 03:11:59

Message: 6 of 8

Hi,

I got this message for one of Mathworks staff regarding this issue.
However I not very how to apply what she suggests. The email is
enclosed.

--------------------------------------------------
Hi,

I am writing in reference to your Service Request # 1-A8L37Z regarding
'Problems while running serial jobs in Matlab DCS.'.

Do simple (jobs involving small data files) execute completely without
errors? We typically encounter this issue if you use a third party
scheduler and submit time-consuming jobs with large data.

Please see the following URL that explains the situation when this
error occurs:
http://www.mathworks.com/support/solutions/en/data/1-8NLCSR/?solution=1-8NLCSR

As mentioned in the above solution, please replace your
'copyJobFilesIfFinished.m' with the modified one (attached for your
convenience). Please let me know if this resolves the issue.

Please preserve the THREAD ID below in any further correspondence on
this query. This will allow our systems to automatically assign your
reply to the appropriate Service Request. If you have a new technical
support question, please submit a new request here:

http://www.mathworks.com.au/contact_TS.html


Sincerely,

Dana Meng
Technical Support Engineer
Technical Support Department
MathWorks
Level 5, Tower 1
495 Victoria Ave.
Chatswood NSW 2067
Australia
----------------------------------------------------------------

Please let me know, if you could solve this issue based on this input.

regards,
shafriza




On Jul 18, 12:45 am, "Steffen" <rile...@gmail.com> wrote:
> Simple bash-code, which sends me an email as soon as a certain Job is done at the cluster. It is still an additional maual step to retrieve the data.
>
> Good luck and keep me posted in case you solve the problem...
>
> Cheers,
> Steffen

Subject: Error message distributed computing

From: Steffen

Date: 27 Jul, 2009 09:33:02

Message: 7 of 8

Hi Shafriza,

thanks a lot for the link! So it is indeed a ssh/size problem.
To be honest, I actually stick to my bash script which works fine. Never change a running system. ;)
But as soon as the important data is being analyzed, I?ll certainly give it a shot.

Thanks and cheers,
Steffen

Subject: Error message distributed computing

From: shafriza

Date: 29 Jul, 2009 02:10:22

Message: 8 of 8

Hi,

I tried it and it works fine (without any more waiting for timer1
problems). I believe there are better error handling algorithm in the
m-files given in the url.

Cheers,
shafriza

On Jul 27, 7:33 pm, "Steffen" <rile...@gmail.com> wrote:
> Hi Shafriza,
>
> thanks a lot for the link! So it is indeed a ssh/size problem.
> To be honest, I actually stick to my bash script which works fine. Never change a running system. ;)
> But as soon as the important data is being analyzed, I?ll certainly give it a shot.
>
> Thanks and cheers,
> Steffen

Tags for this Thread

Everyone's Tags:

Add a New Tag:

Separated by commas
Ex.: root locus, bode

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Tag Activity for This Thread
Tag Applied By Date/Time
distributed com... Steffen 13 Jul, 2009 04:04:02
rssFeed for this Thread

Contact us at files@mathworks.com