How to use MDCS with a Torque based cluster

5 views (last 30 days)
Greetings,
My background is in clusters; I know next to nothing about Matlab. However, I have many users who use Matlab on a daily basis. Thus I have been learning a lot over the last few weeks.
The cluster has a wide variety of users: C/MPI, C/OpenMP, Fortran/MPI, R, Python, ...ect... , and Matlab.
Previously, the users have had to log into the cluster, start an interactive session, and then launch Matlab on the cluster node. This has caused lots of problems (for users and admins) and confusion all around. I spoke with someone who claimed the answer to this was the Parallel toolbox. Turns out that wasn't correct, but after talking with Matlab support I was given a trial version of MDCS.
My goal: allow the users to log into the login node, start up Matlab, and every time they try to run a job have Matlab contact the Torque/Moab scheduler such that the job launches on the cluster with the best possible resources available to it as quickly as possible (aka: use the already working Torque/Moab scheduler). I don't care if the job is a single thread with no toolboxes or if the job uses every single toolbox we have across every core on every node available. I want it to run on a node in the cluster through the Torque/Moab scheduler and not on the login node where it can interrupt other users.
My shoot-for-the-skys goal: allow the users (who most all of them also have licenses for Matlab running on their low-end desktops as well) to submit jobs to the cluster without first having to log in to the cluster, transfer their code, and then run the job.
The documentation was not the most straight forward document for MDCS, especially for a Linux Torque/Moab cluster. However, the error messages and debugging was surprisingly helpful. The error messages pretty much covered all the gaps in the document. Yesterday, I finally got the Parallel->Manage Cluster Profiles->TorqueProfile to validate and run jobs on the cluster! Hooray!
But after that, I am really struggling here.
  • I was given a "simple" parfor code to run by one of my users as well as instructions on how to run it. When I run it, it runs locally, not on the cluster. I know I am missing a step between the run button and the now default TorqueProfile but I haven't yet figured out what.
  • I found some documentation on running Matlab code on a Linux cluster, but it was for Matlabs jobs scheduler and most of it didn't seem to apply to Torque/Moab. It seems to me that I am just stumbling around in the dark with web searches and random blog postings as my guide points.
  • Several of us are now scratching our heads wondering how we actually utilize this MDCS whether it be to submit a job for a single thread or for multiple cores on multiple nodes. Picking the right amount of resources and time is quite important and varies between the user, the project, and the jobs being run. So far the best I have come up with is a different profile per job, but that is a rather insane method to inflict upon the users.
So the questions I have are: 1) Will MDCS actually do what I am wanting it to do? Will it actually meet my expectations/goals? Or is this not the right path for this product?
2) How can I allow/force users to run jobs on the cluster and not on the login node?
3) How do I make this ridiculously simple for the users so they can focus on their work and not on trying to configure their program for the cluster? I fear that right now there are too many steps and things to do, thus they will not use it or it will just be a source of time consuming problems for everyone. I have a feeling that it is as simple as the fact I just don't understand yet how to make things function so any information to improve this would be appreciated.
Thank you!

Accepted Answer

Edric Ellis
Edric Ellis on 4 Jul 2013
It seems that you are making reasonable progress towards getting things working. We are always happy to hear about suggestions for making the installation documentation clearer, so please comment if you have anything specific that would have made things easier for you.
I think the remaining problem is getting the correct resource request for your jobs. In the cluster profile manager, when you're configuring the Torque profile, you'll see a field called "ResourceTemplate". In this, you need to place a template version of whatever resource specification you need for your cluster. In the template, there's a placeholder "^N^" that gets filled out when a job is submitted. For example, if you were to specify the ResourceTemplate as "-l nodes=^N^", then when a user requests "matlabpool open 10", that would be converted to "-l nodes=10". The precise details of the specification you need there vary by cluster setup, which is why we've made it configurable.
  1 Comment
Chris
Chris on 10 Jul 2013
Greetings,
I figured it out this morning. The template should be "-l procs=^N^".
Thanks for pointing me in the right direction Edric.

Sign in to comment.

More Answers (5)

Chris
Chris on 3 Jul 2013
Naturally, minutes after I post I get an email from a user with more information....
I have the demo trial that is supposedly good for 256 workers. My cluster easily has that many cores. However we get an error when we run: matlabpool open 256
Error using matlabpool (line 144) Failed to open matlabpool. (For information in addition to the causing error, validate the profile 'TorqueProfile1' in the Cluster Profile Manager.)
Caused by: Error using parallel.internal.pool.InteractiveClient/start (line 281) Failed to start matlabpool. Error using parallel.Job/submit (line 304) Error executing the PBS script command 'qsub'. The reason given is qsub: submit error (Job exceeds queue resource limits MSG=cannot locate feasible nodes (nodes file is empty or all systems are busy))
But, if we do 16 (the total amount of cores on a single node), it succeeds and we can run via the command line.
We haven't figured out the run button yet and the limit of a single node is still an issue but this is good progress in and of itself! :-)

Chris
Chris on 3 Jul 2013
Another addendum. It seems the parfor loop he wrote works with the matlabpool, but a part of the problem is that the second "test" we were doing was bench(10) and that always runs locally and not on the matlabpool.
I don't know enough about matlab yet to say if that is supposed to work that way or not. But we may need another test to verify jobs are running properly on the nodes.
Thanks!

Chris
Chris on 3 Jul 2013
Edited: Chris on 3 Jul 2013
More information. After some more research I found someone who was able to span multiple computers by adding a list parameter. Thus, I added "-l nodes=10:ppn=16" to the additional Torque properties. It doesn't matter what I request, I always have a job of that size so it isn't a final solution by any means. Still, I tried running a test job.
The result: Trial>> matlabpool open 10
Starting matlabpool using the 'TorqueProfile1' profile ... stopped.
Error using matlabpool (line 144) Failed to open matlabpool. (For information in addition to the causing error, validate the profile 'TorqueProfile1' in the Cluster Profile Manager.)
Caused by: Error using parallel.internal.pool.InteractiveClient>iThrowIfBadParallelJobStatus (line 783) The interactive communicating job failed with no message.
It says no message, but it doesn't give me much more then that. So I went to the Torque/Moab logs. As far as the scheduler is concerned it was started successfully and Matlab killed the job. There is no error I can see.
I am stuck again. Any ideas?
Thanks.

Chris
Chris on 5 Jul 2013
Ah, OK. I didn't connect that the ^N^ was the variable. I thought I was supposed to replace that with my own numbers.
I am going to do some more testing and I will post back on the progress.
Thank you.

Chris
Chris on 10 Jul 2013
Greetings,
I made the change but I am still having issues with Matlab not getting the right number of nodes. I have 20 nodes. Each node has 16 cores. The demo license should be for 256.
If I tell matlab I have 20 workers, it fires up jobs on all 16 cores of one node + 4 more on a second node.
If I tell matlab I have 21 workers, the job dies saying it doesn't have the resources.
If I explicitly state the max amount of cores I have (16 processors on 16 nodes for 256) it will validate properly with all 256 but I end up in the same sitution as before.
Any ideas why Matlab isn't launching the workers properly?
Thanks!

Categories

Find more on Cluster Configuration in Help Center and File Exchange

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!