Bioinformatics Toolbox 3.4
Batch Processing of Spectra Using Sequential and Parallel Computing
This demonstration illustrates how you can use a single computer, a multicore computer, or a cluster of computers to preprocess a large set of mass spectrometry signals. Note: Parallel Computing Toolbox™ and MATLAB® Distributed Computing Server™ are required for the last part of this demo.
Contents
Introduction
This demonstration shows the required steps to set up a batch operation over a group of mass spectra contained in one or more directories. You can achieve this sequentially, or in parallel using either a multicore computer or a cluster of computers. Batch processing adapts to the single-program multiple-data (SPMD) parallel computing model, and it is best suited for Parallel Computing Toolbox™ and MATLAB® Distributed Computing Server™.
The signals to preprocess come from protein surface-enhanced laser desorption/ionization-time of flight (SELDI-TOF) mass spectra. The data in this example are from the FDA-NCI Clinical Proteomics Program Databank (http://home.ccr.cancer.gov/ncifdaproteomics/ ). In particular, the demo uses the high-resolution ovarian cancer data set that was generated using the WCX2 protein array.
Setting the Repository for the Data
This demonstration assumes that you have downloaded and uncompressed the data sets into your repository. Ideally, you should place the data sets in a network drive. If the workers all have access to the same drives on the network, they can access needed data that reside on these shared resources. This is the preferred method for sharing data, as it minimizes network traffic.
First, get the name and full path to the data repository. Two strings are defined: the path from the local computer to the repository and the path required by the cluster computers to access the same directory. Change both accordingly to your network configuration.
local_repository = '//home/user/MassSpecRepository/OvarianCD_PostQAQC/'; worker_repository = '//public/user/MassSpecRepository/OvarianCD_PostQAQC/';
For this particular example, the files are stored in two subdirectories: 'Normal' and 'Cancer'. You can create lists containing the files to process using the DIR command,
cancerFiles = dir([local_repository 'Cancer/*.txt']) normalFiles = dir([local_repository 'Normal/*.txt'])
cancerFiles =
121x1 struct array with fields:
name
date
bytes
isdir
datenum
normalFiles =
95x1 struct array with fields:
name
date
bytes
isdir
datenum
and put them into a single variable:
files = [ strcat('Cancer/',{cancerFiles.name}) ... strcat('Normal/',{normalFiles.name})]; N = numel(files) % total number of files
N = 216
Sequential Batch Processing
Before attempting to process all the files in parallel, you need to test your algorithms locally with a for loop.
Write a function with the sequential set of instructions that need to be applied to every data set. The input arguments are the path to the data (depending on how the machine that will actually do the work sees them) and the list of files to process. The output arguments are the preprocessed signals and the M/Z vector. Because the M/Z vector is the same for every spectrogram after the preprocessing, you need to store it only once. For example:
type msbatchprocessing
function [MZ,Y] = msbatchprocessing(repository,files)
% MSBATCHPROCESSING Demonstration function for BIODISTCOMPDEMO
%
% [MZ,Y] = MSBATCHPROCESSING(REPOSITORY,FILES) Preprocesses the
% spectrogram in files FILES and returns the mass/charge (MZ) and ion
% intensities (Y) vectors.
%
% Hard-coded parameters in the preprocessing steps have been adjusted to
% deal with the high-resolution spectrograms of the demonstration.
% Copyright 2004-2008 The MathWorks, Inc.
K = numel(files);
Y = zeros(15000,K); % need to preset the size of Y for memory performance
MZ = zeros(15000,1);
parfor (k = 1:K)
file = [repository files{k}];
% read the two-column text file with mass-charge and intensity values
D = textread(file);
% resample the signal to 15000 points between 2000 and 11900
[mz,YR] = msresample(D(:,1),D(:,2),15000,'RANGE',[2000,11900]);
% align the spectrograms to two good reference peaks
P = [3883.766 7766.166];
YA = msalign(mz,YR,P,'WIDTH',2);
% estimate and adjust the background
YB = msbackadj(mz,YA,'STEP',50,'WINDOW',50);
% reduce the noise using a nonparametric filter
Y(:,k) = mslowess(mz,YB,'SPAN',5);
% the mass/charge vector is the same for all spectra after the resample
if k==1
MZ(:,k) = mz;
end
end
Note inside the function MSBATCHPROCESSING the intentional use of PARFOR instead of FOR. Batch processing is generally implemented by tasks that are independent between iterations. In such case, the statement FOR can indifferently be changed to PARFOR, creating a sequence of MATLAB® statements (or program) that can run seamlessly on a sequential computer, a multicore computer, or a cluster of computers, without having to modify it. In this case, the loop executes sequentially because you have not set any parallel configuration. For the demo purposes, only 20 spectrograms are preprocessed and stored in the Y matrix. You can measure the amount of time MATLAB® takes to complete the loop using the TIC and TOC commands.
tic repository = local_repository; K = 20; % change to N to do all [MZ,Y] = msbatchprocessing(repository,files(1:K)); disp(sprintf('Sequential time for %d spectrograms: %f seconds',K,toc))
Sequential time for 20 spectrograms: 64.480435 seconds
Parallel Batch Processing with Multicore Computers
If you have Parallel Computing Toolbox™, you can use up to four local workers to parallelize the loop iterations. For example, if your local machine is dual-core, you can start a worker pool with two labs using the default 'local' parallel configuration:
matlabpool open local 2 tic repository = local_repository; K = 20; % change to N to do all [MZ,Y] = msbatchprocessing(repository,files(1:K)); disp(sprintf('Parallel time with two local labs for %d spectrograms: %f seco nds',K,toc))
Starting matlabpool using the parallel configuration 'local'. Waiting for parallel job to start... Connected to a matlabpool session with 2 labs. Parallel time with two local labs for 20 spectrograms: 35.670026 seconds
Stop the local worker pool:
matlabpool close
Sending a stop signal to all the labs... Waiting for parallel job to finish... Performing parallel job cleanup... Done.
Parallel Batch Processing with Distributed Computing
If you have Parallel Computing Toolbox™ and MATLAB® Distributed Computing Server™ you can also distribute the loop iterations to a larger number of computers. In this example, the parallel configuration 'compbio_config_01' links to a cluster of four dual-core machines (eight workers). For information about setting up and selecting parallel configurations, see "Programming with User Configurations" in the Parallel Computing Toolbox™ documentation.
Note that if you have written your own batch processing function, you should include it in the respective configuration by using the Configurations Manager. This will ensure that MATLAB® properly transmits your new function to the workers. You access the Configurations Manager using the Parallel pull-down menu on the MATLAB® desktop.
matlabpool open compbio_config_01 8 tic repository = worker_repository; K = 20; % change to N to do all [MZ,Y] = msbatchprocessing(repository,files(1:K)); disp(sprintf('Parallel time with 8 remote labs for %d spectrograms: %f secon ds',K,toc))
Starting matlabpool using the parallel configuration 'compbio_config_01'. Waiting for parallel job to start... Connected to a matlabpool session with 8 labs. Parallel time with 8 remote labs for 20 spectrograms: 8.552093 seconds
Stop the remote worker pool:
matlabpool close
Sending a stop signal to all the labs... Waiting for parallel job to finish... Performing parallel job cleanup... Done.
Asynchronous Parallel Batch Processing
The execution schemes described above all operate synchronously, that is, they block the MATLAB® command line until their execution is completed. If you want to start a batch process job and get access to the command line while the computations run asynchronously (async), you can manually distribute the parallel tasks and collect the results later. This example uses the same four dual-core machine cluster as before.
Create one job with one task (MSBATCHPROCESSING). The task runs on one of the workers, and its internal PARFOR loop is distributed among all the available workers in the parallel configuration. Note that if N (number of spectrograms) is much larger than the number of available workers in your parallel configuration, Parallel Computing Toolbox™ automatically balances the work load, even if you have a heterogeneous cluster.
tic % start the clock repository = worker_repository; K = N; % do all spectrograms JOB = createMatlabPoolJob('configuration','compbio_config_01',... 'MinimumNumberOfWorkers',8,'MaximumNumberOfWorkers',8); TASK = createTask(JOB,@msbatchprocessing,2,{repository,files(1:K)}); submit(JOB)
When the job is submitted, your local MATLAB® prompt returns immediately. Your parallel job starts once the parallel resources become available. Meanwhile, you can monitor your parallel job by inspecting the TASK or JOB objects. Use the WAITFORSTATE method to programmatically wait until your task finishes:
waitForState(TASK, 'finished')
TASK.OutputArguments
ans =
[15000x1 double] [15000x216 double]
MZ = TASK.OutputArguments{1};
Y = TASK.OutputArguments{2};
destroy(JOB) % done retrieving the results
disp(sprintf('Parallel time (asynchronous) with 8 remote labs for %d spectro
grams: %f seconds',K,toc))
Parallel time (asynchronous) with 8 remote labs for 216 spectrograms: 96.969 721 seconds
Postprocessing
After collecting all the data, you can use it locally. For example, you can apply group normalization:
Y = msnorm(MZ,Y,'QUANTILE',0.5,'LIMITS',[3500 11000],'MAX',50);
Create a grouping vector with the type for each spectrogram as well as indexing vectors. This "labelling" will aid to perform further analysis on the data set.
grp = [repmat({'Cancer'},size(cancerFiles));...
repmat({'Normal'},size(normalFiles))];
cancerIdx = strmatch('Cancer',grp);
numel(cancerIdx) % number of files in the "Cancer" subdirectory
ans = 121
normalIdx = strmatch('Normal',grp); numel(normalIdx) % number of files in the "Normal" subdirectory
ans =
95
Once the data is labelled, you can display some spectrograms of each class using a different color (the first five of each group in this example).
h = plot(MZ,Y(:,cancerIdx(1:5)),'b',MZ,Y(:,normalIdx(1:5)),'r'); axis([7650 8200 -2 50]) xlabel('Mass/Charge (M/Z)');ylabel('Relative Intensity') legend(h([1 end]),{'Ovarian Cancer','Control'}) title('Region of the pre-processed spectrograms')
Save the preprocessed data set, because it will be used in the demos Identifying Significant Features and Classifying Protein Profiles and Genetic Algorithm Search for Features in Mass Spectrometry Data.
save OvarianCancerQAQCdataset.mat Y MZ grp
Disclaimer
TIC - TOC timing is presented here as an example. The sequential and parallel execution time will vary depending on the hardware you use. In this demo, we used an AMD® Athlon™ Dual-core (2.4GHz, 2GB, win32) as the local machine and a cluster of eight workers in four AMD Athlon™ Dual-core (2.4GHz, 2GB, Linux®) machines. The job manager also runs on one of the AMD computers.
Store