RUN MATLAB COMPUTATIONS ON A SUN GRIDENGINE CLUSTER
This demo shows how QSUB_SUBMIT_CM, QSUB_RUN_CM and QSUB_CHECK_FINISH can be used to run MATLAB computations on a cluster of UNIX/Linux machines. The actual user interface is QSUB_SUBMIT_CM. Both QSUB_RUN_CM and QSUB_CHECK_FINISH are usually only called from code that is generated in QSUB_SUBMIT_CM.
However, before describing QSUB_SUBMIT_CM, the underlying mechanisms of job execution are explained. Trouble shooting in a distributed environment can be quite tricky. Therefore it is quite important to know where to start debugging when a certain kind of error occurs.
Contents
CREATE JOB STRUCTURE
Each computation job is stored in a job structure. Here, a very simple job is created to demonstrate how job submission, execution and results collection work. The task is to compute sin(rand(100)) and to return the result.
job.fun = @sin;
job.job = {rand(100)};
job.noutputs = 1;
job.ctx.path = path;
TESTING JOB EXECUTION
Before submitting jobs to a cluster, it is very important to test the correctness of job execution. There are different levels of tests:
- Execute job.fun on a sample data set
- Save job to a .mat file and run QSUB_RUN_CM from the current MATLAB session
- Use the saved job and run MATLAB with the run script that would be generated by QSUB_SUBMIT_CM from a UNIX command line
TESTING JOB.FUN
The actual execution takes place in QSUB_RUN_CM around line 39ff. This code is copied here to test execution of job.fun. Since job.noutputs equals 1 in our example, the else clause of the if statement will be executed.
if job.noutputs == 0, % evaluate job function - no output job.fun(job.job{:}); else % evaluate job function - capture output out.out = cell(1,job.noutputs); [out.out{:}] = deal(job.fun(job.job{:})); end
The variable out should now contain a field out, which is a cell array with one member:
disp(out)
out: {[100x100 double]}
err: {}
It should contain the output argument of sin(rand(100)):
disp(isequalwithequalnans(out.out{1}, sin(job.job{1})))
1
TESTING QSUB_RUN_CM
To test QSUB_RUN_CM, the job has to be saved to a .mat file. This file does not need to have a .mat extension. Note that the fields of the job variable are saved as individual variables in this file:
jobfilename = '/tmp/testjob.in'; save(jobfilename, '-struct','job');
QSUB_RUN_CM expects three filenames as input - jobfilename, outfilename and flagfilename. The first one is the existing file containing the job description and inputs. The other files will be created by QSUB_RUN_CM.
outfilename = '/tmp/testjob.out'; flagfilename = '/tmp/testjob.flag';
Note: Running QSUB_RUN_CM will quit your MATLAB session after the job has finished! The command to run would be: qsub_run_cm(jobfilename, outfilename, flagfilename); After running QSUB_RUN_CM and restarting MATLAB, the results can be loaded:
out = load(outfilename, '-mat');
disp(out)
out: {[100x100 double]}
err: {}
If everything went well, out.out should contain a cell array of output arguments computed by job.fun. If some error occured, out.err contains an MException object with an error description.
TESTING MATLAB INVOCATION
The next step is to test the script which will be created by QSUB_SUBMIT_CM to run MATLAB. The MATLAB command line is constructed in QSUB_SUBMIT_CM at line 91f
runpath = fileparts(which('qsub_run_cm')); mlcmd = sprintf(['%s -nodisplay -r ' ... '"addpath(''%s'');qsub_run_cm(''%%s'',''%%s'',''%%s'');"'], fullfile(matlabroot,'bin','matlab'), runpath);
This command contains placeholders for the three filename arguments of QSUB_RUN_CM. The actual shell script is created in QSUB_SUBMIT_CM around line 116ff
% Create executable shell script to run the job scriptname = [jobfilename '.sh']; fid = fopen(scriptname,'w'); fprintf(fid, '#!/bin/sh\n'); fprintf(fid, mlcmd, jobfilename, outfilename, flagfilename); fclose(fid); fileattrib(scriptname, '+x');
This script can then be executed from a Linux/UNIX command line. It should open a MATLAB session, run QSUB_RUN_CM and close MATLAB again. Note that the correct invocation and shell may differ from this example, depending on the shells available on the system.
[sts, termout] = unix(sprintf('. %s',scriptname))
sts =
0
termout =
/usr/local/matlab2009a/bin/matlab: 674: shopt: not found
< M A T L A B (R) >
Copyright 1984-2009 The MathWorks, Inc.
Version 7.8.0.347 (R2009a) 64-bit (glnxa64)
February 12, 2009
Warning: Duplicate directory name: /home/volkmar/matlab.
Warning: Duplicate directory name: /usr/local/matlab2009a/toolbox/local.
Warning: Duplicate directory name: /home/volkmar/matlab.
To get started, type one of these: helpwin, helpdesk, or demo.
For product information, visit www.mathworks.com.
This script should be ready to be submitted to qsub as well:
[sts, qsubout] = unix(sprintf('qsub %s',scriptname))
sts =
0
qsubout =
Your job 637 ("testjob.in.sh") has been submitted
JOB SUBMISSION THROUGH QSUB_SUBMIT_CM
QSUB_SUBMIT_CM is the interface through which jobs are actually submitted. It saves the input data .mat file, creates the script command and invokes qsub to submit the job. In addition, it can also start a MATLAB timer object to supervise the job and retrieve computation results. A user specified callback has to be provided to use this feature. The simplest callback would be to display the computed results.
jobname = 'testjob'; jobdir = '/tmp'; % This must be a rw folder on a shared network drive finishcb = @disp; qsub_submit_cm(job, jobdir, jobname, finishcb)
EVALUATION OF RESULTS IN FINISHCB
The job timer will be monitored by QSUB_CHECK_FINISH. When the job has finished, this function will try to load the job output file and pass its contents on to the specified callback. If the computation was successful, finishcb will be called with a cell array containing the output(s) of the computation. If the computation failed, finishcb will be called with an MException object describing the reason for failure. An simple callback could rethrow an exception or display the output:
function finishcb(out)
if isa(out, 'MException')
rethrow(out)
else
disp(out)
end