Thread Subject: errors during parallel computing

Subject: errors during parallel computing

From: Juliette Salexa

Date: 10 Aug, 2009 03:41:01

Message: 1 of 4

I submitted a parallel job to 4 labs, and for a long time didn't get any answer.. the command window just said "busy" at the bottom.

When I noticed that matlab's cpu usage went from 100% to 0%, I realized the computations must have stopped, so I finally decided to cancel it with CTRL+C,

and then AFTER it got cancelled and the answer to my command TOC was executed,
I got:

Elapsed time is 3862.371101 seconds.
??? A read error occurred while reading from lab 2. This is causing:
java.net.SocketException: Connection reset

And the error message (lines 2 and 3) came a bit after the 'elapsed time'


Of course, problem number 1 is that this error message is not very descriptive.
But even more importantly, why does it wait until AFTER i cancel it to tell me that there was an error reading from lab 2.

I notice this even with reporting answers to expressions without semicolons: For example, if I have a tic; and toc within a funciton, matlab waits until AFTER the entire process is over to tell me what the answer to TOC is.

Is there no way to FORCE matlab to tell me if it encounters an error (or the answers to expressions without semicolons) in realtime, rather than waiting for the whole process to finish ??

This was especially annoying since the command window just said "busy" , even after it encountered the fatal error... when it really wasn't busy, it wasn't doing anything.

Subject: errors during parallel computing

From: us

Date: 10 Aug, 2009 09:47:02

Message: 2 of 4

"Juliette Salexa" <juliette.physicist@gmail.com> wrote in message <h5o4sd$m7f$1@fred.mathworks.com>...
> I submitted a parallel job to 4 labs
> This was especially annoying since the command window just said "busy"...

thanks for your interesting, in-depth description of your ML experience...
needless to say that this NG cannot be of any help in this holy matter...

try

http://www.mathworks.com/support/contact_us/index.html

us

Subject: errors during parallel computing

From: Edric M Ellis

Date: 11 Aug, 2009 12:40:08

Message: 3 of 4

"Juliette Salexa" <juliette.physicist@gmail.com> writes:

> I submitted a parallel job to 4 labs, and for a long time didn't get any
> answer.. the command window just said "busy" at the bottom.

Did you submit a non-interactive job, or did you open MATLABPOOL? Were you using
remote machines, or only local workers?

> When I noticed that matlab's cpu usage went from 100% to 0%, I realized the
> computations must have stopped, so I finally decided to cancel it with CTRL+C,
>
> and then AFTER it got cancelled and the answer to my command TOC was executed,
> I got:
>
> Elapsed time is 3862.371101 seconds.
> ??? A read error occurred while reading from lab 2. This is causing:
> java.net.SocketException: Connection reset

Were you using SPMD or PARFOR when doing this?

I appreciate that the error message is a little cryptic, this basically simply
means that the connection from the desktop MATLAB to the workers was
unexpectedly severed. Under certain circumstances, this error gets reported
asynchronously - where possible, we do try to report the error in such a way
that the execution gets aborted in the expected way. Without knowing more about
your computation, it's hard to tell what went wrong.

Cheers,

Edric.

Subject: errors during parallel computing

From: Juliette Salexa

Date: 11 Aug, 2009 19:46:04

Message: 4 of 4

Thank you us and Edric,

I haven't had much luck iwth tech support since I have a student version.
------------------------------------------------
> Did you submit a non-interactive job, or did you open MATLABPOOL? Were you using
> remote machines, or only local workers?
------------------------------------------------
I don't know what a 'non-interactive job' is... but I did open a MATLABPOOL, and opened 4 local workers.
------------------------------------------------
> Were you using SPMD or PARFOR when doing this?
------------------------------------------------
I was using PARFOR
------------------------------------------------
> I appreciate that the error message is a little cryptic, this basically simply
> means that the connection from the desktop MATLAB to the workers was
> unexpectedly severed. Under certain circumstances, this error gets reported
> asynchronously - where possible, we do try to report the error in such a way
> that the execution gets aborted in the expected way. Without knowing more about
> your computation, it's hard to tell what went wrong.
------------------------------------------------

This same problem has occured dozens of times during the last week.
The program continues running indefinitely, but after some amount of time I notice in my task manager than matlab.exe has gone from using 100% cpu to 0% cpu.

So since matlab is no longer doing anything, I go to the command window and press CTRL+C.

The computation stops, and then about 10 seconds later I get the message:

??? A read error occurred while reading from lab 2. This is causing:
java.net.SocketException: Connection reset


This happens so consistently that I can actually predict what the error message will be when I notice that the cpu usage has gone to 0% and yet the program 'appears to be' still running.

What's strange is that when I rerun the EXACT same program with no modifications, the program sometimes proceeds to completion, but sometimes will not and in this case will give the above error message 10 seconds after pressing CTRL+C. (sometimes the error will be from reading lab 3 or lab 4)

So the problem is not deterministic, which makes me believe either ML is doing something non-deterministically, or something on my computer is interfering with it [ although this has also occured at night time with nothing else running ]

The way I got around this was writing a script that says "if CPU usage goes to 0% AND matlabpool is still open, exit matlab and rerun the exact same code"

And after about 5 iterations (sometimes less), the code will run to completion.

But it would still be nice to understand WHY this connection between the desktop ML and its labs gets severed, and how to avoid it in the future. I'm quite sure it's not depletion of RAM.

Tags for this Thread

Everyone's Tags:

Add a New Tag:

Separated by commas
Ex.: root locus, bode

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Tag Activity for This Thread
Tag Applied By Date/Time
reference us 10 Aug, 2009 05:49:04
rssFeed for this Thread

Contact us at files@mathworks.com