about spmd, gpu calculation in a cluster

3 views (last 30 days)
Ki
Ki on 26 Sep 2012
Hi there, I am using Matlab 2012b in a multi-node multi-processor cluster with GPU-supported video gard installed. The cluster provide a task manager (qsub) to run the code over multi-node and/or multi-processor. I set up a matlab code to run 1 node and 8 processor with spmd. It runs without any problem. But since I have to run a batch of task with different parameters, I setup a script to qsub all my programs (total 100 matlab program, each run spmd on 8 processors). The script load the task pretty faster but I just found that in this case, it will report that the work pool fail to open. But if I run the program run by run manually, it doesn't report the same error. Is that anything wrong in this case?
By the way, I am wondering if it works if I call the gpu process with gpuarray within spmd block, if there any way for matlab to avoid conflicat and/or IO error in parallelism? Thanks
  2 Comments
Jason Ross
Jason Ross on 27 Sep 2012
Edited: Jason Ross on 27 Sep 2012
It would probably help if you could clarify somewhat. Given that you are using qsub, I'm guessing you are using PBS Pro or Torque. Are you submitting your work using the direct integration (on the Parallel menu), or via some other means? Are you using the generic interface?
When you say you wrote a script, are you talking about a MATLAB script, shell script, or something else? Also, when you say "batch" are you talking about the MATLAB batch command, or do you mean something else? You might want to look at batch to see if it might do what you need.
When you say "the work pool failed to open", do you mean that MATLAB says "the matlabpool failed to open", or is this an error from somewhere else?
When you do "run by run manually", how are you doing that? Through MATLAB? At the command line on your system?
For your spmd questions, you might be able to use "labindex" to do what you want, it would allow you to select the GPU you want the code to run on.
It might also help if you posted some example code snippets to show what you are doing.
Ki
Ki on 27 Sep 2012
Hi Jason, Well, I don't know much about the cluster structure since it is not maintaining by us, it is running by other party and open to registered user. But I think it is using PBS.
We can use qsub to manually load the matlab script to the system and it will schedule it for running when the resource is available. But it takes too much effort to manually load all scripts if we have so many program to run. So I write a bash script to search all matlab programs in the current folder and subfolders and load them into the system with qsub one by one. Something like that:
find . -name *.m | while read FILENAME; do qsub $FILENAME; done
here 'qsub xxx' is abbreviation for loading the corresponding matlab program.
Note that the cluster is multi-node, multi-processor architecture, so once the program loaded, it will assign one node and 8 processor to the matlab to run the program. Let's say, I have 10 programs to run, if I load all 10 programs automatically (with script), which is pretty fast, some programs cannot be run correctly and matlab said 'the matlabpool failed to open'. But for the same programs, if I run it one by one manually, then no such problem occurs. I am wondering if any confliction if we ask matlab to open pool 'almost' at the same time? even they are running in different node?
By "run manually", I mean I type and call qsub to load the program in the command line instead of using that scripts to search and run qsub automatically.
By the way, even run load all those programs manually, says the first 5 programs have already been load and running in the background (each program working on 8 processors so total 5*8=40 workers are running). While I load the 6th program, when it run, the matlab console might show that 40 workers already open and you might close it by forcing it to close. Why's that? Is that mean I cannot run more workers (note that every 8 workers are on different node)
For spmd question, I am not going to select the GPU to run. But the structurer of the system is 8 processors and 1GPU. If I have following code to run
matlabpool open 8 spmd X = gpuArray(rand(100)); Y = X.*X; end matlabpool close
so spmd are actually running over 8 processors but each one will call the GPU to run X.*X, right? So will those 8 processors conflict while there is only one GPU?

Sign in to comment.

Answers (1)

Jason Ross
Jason Ross on 28 Sep 2012
Some of the things make sense now.
For the "matlabpool failed to open" problems, I'd suspect that the scheduler has more compute resources than you do MATLAB licenses. So when you open a pool, you consume X licenses, and when you put 10X jobs on the scheduler, it tries to access 10X licenses when you only have 5X -- although the scheduler itself might have the resources to run the 10X number of jobs. If you wait for the 5X jobs to complete, can you run the next batch of 5 without incident? I know some schedulers can be configured to check a license count and hold off jobs if the required number of licenses is not available -- you might want to ask about that.
I believe this solution addresses your GPU question. It sounds like it should be possible, although the access will be serialized:
You might also want to look into using the direct integration with PBS/Torque if possible, as it sounds like you might be able to avoid the step of writing the batch file to submit the jobs to the scheduler.
  2 Comments
Ki
Ki on 29 Sep 2012
Hi Jason, I don't think it is a license problem since that cluster owned by academic unit and it comes with license for running multiple matlab. What strange is if we run those programs with script, says we run 10 programs, you will see the incident if the first 5 or 6 programs were run. But if you run those manually, no incident will be shown.
Jason Ross
Jason Ross on 2 Oct 2012
That is very puzzling. Could it be that the pool is already open on some of the workers and then the follow-up jobs get placed on the same one, in effect trying to open the same pool a second time?

Sign in to comment.

Categories

Find more on Introduction to Installation and Licensing in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!