How to use MDCS with a Torque based cluster
5 views (last 30 days)
Show older comments
Greetings,
My background is in clusters; I know next to nothing about Matlab. However, I have many users who use Matlab on a daily basis. Thus I have been learning a lot over the last few weeks.
The cluster has a wide variety of users: C/MPI, C/OpenMP, Fortran/MPI, R, Python, ...ect... , and Matlab.
Previously, the users have had to log into the cluster, start an interactive session, and then launch Matlab on the cluster node. This has caused lots of problems (for users and admins) and confusion all around. I spoke with someone who claimed the answer to this was the Parallel toolbox. Turns out that wasn't correct, but after talking with Matlab support I was given a trial version of MDCS.
My goal: allow the users to log into the login node, start up Matlab, and every time they try to run a job have Matlab contact the Torque/Moab scheduler such that the job launches on the cluster with the best possible resources available to it as quickly as possible (aka: use the already working Torque/Moab scheduler). I don't care if the job is a single thread with no toolboxes or if the job uses every single toolbox we have across every core on every node available. I want it to run on a node in the cluster through the Torque/Moab scheduler and not on the login node where it can interrupt other users.
My shoot-for-the-skys goal: allow the users (who most all of them also have licenses for Matlab running on their low-end desktops as well) to submit jobs to the cluster without first having to log in to the cluster, transfer their code, and then run the job.
The documentation was not the most straight forward document for MDCS, especially for a Linux Torque/Moab cluster. However, the error messages and debugging was surprisingly helpful. The error messages pretty much covered all the gaps in the document. Yesterday, I finally got the Parallel->Manage Cluster Profiles->TorqueProfile to validate and run jobs on the cluster! Hooray!
But after that, I am really struggling here.
- I was given a "simple" parfor code to run by one of my users as well as instructions on how to run it. When I run it, it runs locally, not on the cluster. I know I am missing a step between the run button and the now default TorqueProfile but I haven't yet figured out what.
- I found some documentation on running Matlab code on a Linux cluster, but it was for Matlabs jobs scheduler and most of it didn't seem to apply to Torque/Moab. It seems to me that I am just stumbling around in the dark with web searches and random blog postings as my guide points.
- Several of us are now scratching our heads wondering how we actually utilize this MDCS whether it be to submit a job for a single thread or for multiple cores on multiple nodes. Picking the right amount of resources and time is quite important and varies between the user, the project, and the jobs being run. So far the best I have come up with is a different profile per job, but that is a rather insane method to inflict upon the users.
So the questions I have are: 1) Will MDCS actually do what I am wanting it to do? Will it actually meet my expectations/goals? Or is this not the right path for this product?
2) How can I allow/force users to run jobs on the cluster and not on the login node?
3) How do I make this ridiculously simple for the users so they can focus on their work and not on trying to configure their program for the cluster? I fear that right now there are too many steps and things to do, thus they will not use it or it will just be a source of time consuming problems for everyone. I have a feeling that it is as simple as the fact I just don't understand yet how to make things function so any information to improve this would be appreciated.
Thank you!
0 Comments
Accepted Answer
Edric Ellis
on 4 Jul 2013
It seems that you are making reasonable progress towards getting things working. We are always happy to hear about suggestions for making the installation documentation clearer, so please comment if you have anything specific that would have made things easier for you.
I think the remaining problem is getting the correct resource request for your jobs. In the cluster profile manager, when you're configuring the Torque profile, you'll see a field called "ResourceTemplate". In this, you need to place a template version of whatever resource specification you need for your cluster. In the template, there's a placeholder "^N^" that gets filled out when a job is submitted. For example, if you were to specify the ResourceTemplate as "-l nodes=^N^", then when a user requests "matlabpool open 10", that would be converted to "-l nodes=10". The precise details of the specification you need there vary by cluster setup, which is why we've made it configurable.
More Answers (5)
See Also
Categories
Find more on Cluster Configuration in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!