Bioinformatics Toolbox 3.4
Genetic Algorithm Search for Features in Mass Spectrometry Data
This demonstration illustrates the use of the Genetic Algorithm and Direct Search Toolbox™ with the Bioinformatics Toolbox™ to optimize the search for features to classify mass spectrometry (SELDI) data.
Contents
Introduction
Genetic algorithms optimize search results for problems with large data sets. You can use the MATLAB® genetic algorithm function to solve these problems in Bioinformatics. Genetic algorithms have been applied to phylogenetic tree building, gene expression and mass spectrometry data analysis, and many other areas of Bioinformatics that have large and computationally expensive problems. This demonstration searches for optimal features (peaks) in mass spectrometry data. We will look for specific peaks in the data that distinguish cancer patients from control patients.
Genetic Algorithm and Directed Search Toolbox
First familiarize yourself with the Genetic Algorithm and Direct Search Toolbox. The documentation describes how a genetic algorithm works and how to use it in MATLAB. To access the documentation, use the doc command.
doc gads
Load Mass Spectrometry Data into MATLAB®
The data in this demo is also used in cancerdetectdemo and is available from the FDA-NCI Clinical Proteomics Program Databank at http://home.ccr.cancer.gov/ncifdaproteomics/ppatterns.asp. It is a collection of samples from 121 ovarian cancer patients and 95 control patients. This demonstration assumes that you downloaded, uncompressed, and preprocessed the raw mass-spectrometry data from the FDA-NCI Web site. You can recreate the preprocessed data file needed for this demo, OvarianCancerQAQCdataset.mat, by running the script msseqprocessing, or by following the steps in the demo biodistcompdemo (Batch processing Through Distributed Computing). There should be two variables that are loaded into MATLAB (MZ and Y). MZ is the mass/charge vector while Y is the intensity values for all 216 patients (control and cancer). To visualize this data see the cancerdetectdemo.
load OvarianCancerQAQCdataset
whos
Name Size Bytes Class Attributes MZ 15000x1 120000 double Y 15000x216 25920000 double ans 1x1 8 double grp 216x1 26784 cell
Initialize the variables used in the demonstration.
[numPoints numSamples] = size(Y); % total number of samples and data points id = grp2idx(grp); % ground truth: Cancer=1, Control=2
Create a Fitness Function for the Genetic Algorithm
A genetic algorithm requires an objective function, also known as the fitness function, which describes the phenomenon that we want to optimize. In this demonstration, the genetic algorithm machinery tests small subsets of M/Z values using the fitness function and then determines which M/Z values get passed on to or removed from each subsequent generation. The fitness function biogafit is passed to the Genetic Algorithm and Direct Search Toolbox using a function handle. In this example, biogafit maximizes the separability of two classes by using a linear combination of 1) the a-posteriori probability and 2) the empirical error rate of a linear classifier (classify). You can create your own fitness function to try different classifiers or alternative methods for assessing the performance of the classifiers.
type biogafit.m
function classPerformance = biogafit(thePopulation,Y,id) %BIOGAFIT The fitness function for BIOGAMSDEMO % % This function uses the classify function to measure how well mass % spectrometry data is grouped using certain masses. The input argument % thePopulation is a vector of row indices from the mass spectrometry % data Y. Classification performance is a linear combination of the error % rate and the posteriori probability of the classifier. % Copyright 2003-2005 The MathWorks, Inc. % $Revision: 1.1.6.2 $ $Date: 2006/09/27 00:18:16 $ thePopulation = round(thePopulation); try [c,e,p] = classify(Y(thePopulation,:)',Y(thePopulation,:)',double(id)); cp = classperf(id,c); classPerformance = 100*cp.ErrorRate + 1 - mean(max(p,[],2)); catch classPerformance = Inf; end
Create an Initial Population
Users can change how the optimization is performed by the genetic algorithm by creating custom functions for crossover, fitness scaling, mutation, selection, and population creation. In this demo, you will use the biogacreate function written for this demo to create initial random data points from the mass spectrometry data. The function header requires specific input parameters as specified by the GA documentation. There is a default creation function in the toolbox for creating initial populations of data points.
type biogacreate.m
function pop = biogacreate(GenomeLength,FitnessFcn,options,Y,id) %BIOGACREATE Population creation function for MSGADEMO % % This function creates a population matrix with dimensions of % options.PopulationSize rows by the number of independent variables % (GenomeLength) columns. These values are integers that correspond to % randomly selected rows of the mass spectrometry data Y. Each row of the % population matrix is a random sample of row indices of the mass spec % data. % Note: This function's input arguments are required by the GA function. % See GA documentation for further detail. % Copyright 2005 The MathWorks, Inc. % $Revision: 1.1.6.2 $ $Date: 2005/08/15 14:49:34 $ ranked_features = rankfeatures(Y,id); pop = zeros(options.PopulationSize,GenomeLength); pop(:) = randsample(ranked_features(1:numel(pop)),numel(pop));
Set Genetic Algorithm Options
The GA function uses an options structure to hold the algorithm parameters that it uses when performing a minimization with a genetic algorithm. The gaoptimset function will create this options structure. For demonstration purposes the genetic algorithm will run only for 50 generations. However, you may set 'Generations' to a larger value.
options = gaoptimset('CreationFcn', {@biogacreate,Y,id},... 'PopulationSize',100,... 'Generations',50,... 'Display', 'iter')
options =
PopulationType: []
PopInitRange: []
PopulationSize: 100
EliteCount: []
CrossoverFraction: []
ParetoFraction: []
MigrationDirection: []
MigrationInterval: []
MigrationFraction: []
Generations: 50
TimeLimit: []
FitnessLimit: []
StallGenLimit: []
StallTimeLimit: []
TolFun: []
TolCon: []
InitialPopulation: []
InitialScores: []
InitialPenalty: []
PenaltyFactor: []
PlotInterval: []
CreationFcn: {1x3 cell}
FitnessScalingFcn: []
SelectionFcn: []
CrossoverFcn: []
MutationFcn: []
DistanceMeasureFcn: []
HybridFcn: []
Display: 'iter'
PlotFcns: []
OutputFcns: []
Vectorized: []
UseParallel: []
Run GA to Find 20 Discriminative Features
Use ga to start the genetic algorithm function. 100 groups of 20 datapoints each will evolve over 50 generations. Selection, crossover, and mutation events generate a new population in every generation.
% initialize the random generators to the same state used to generate the % published example rand('seed',1) randn('seed',1) nVars = 20; % set the number of desired features FitnessFcn = {@biogafit,Y,id}; % set the fitness function feat = ga(FitnessFcn,nVars,options); % call the Genetic Algorithm feat = round(feat); Significant_Masses = MZ(feat) cp = classperf(classify(Y(feat,:)',Y(feat,:)',id),id); cp.CorrectRate
Best Mean Stall
Generation f-count f(x) f(x) Generations
1 200 1.883 Inf 0
2 300 1.875 Inf 0
3 400 0.9538 Inf 0
4 500 0.9538 Inf 1
5 600 0.9485 Inf 0
6 700 0.9393 Inf 0
7 800 0.4769 Inf 0
8 900 0.01278 Inf 0
9 1000 0.01278 Inf 1
10 1100 0.01224 Inf 0
11 1200 0.01203 Inf 0
12 1300 0.01203 Inf 1
13 1400 0.01124 Inf 0
14 1500 0.009127 Inf 0
15 1600 0.009127 Inf 1
16 1700 0.007756 Inf 0
17 1800 0.007068 Inf 0
18 1900 0.006344 Inf 0
19 2000 0.006344 Inf 1
20 2100 0.006131 Inf 0
21 2200 0.006131 Inf 1
22 2300 0.005432 Inf 0
23 2400 0.005432 Inf 1
24 2500 0.004906 Inf 0
25 2600 0.003587 Inf 0
26 2700 0.003587 Inf 1
27 2800 0.003587 Inf 2
28 2900 0.003587 Inf 3
29 3000 0.003342 0.06646 0
30 3100 0.003017 0.04306 0
Best Mean Stall
Generation f-count f(x) f(x) Generations
31 3200 0.003017 0.02395 1
32 3300 0.003017 0.01873 2
33 3400 0.002982 0.01379 0
34 3500 0.002939 0.01358 0
35 3600 0.002907 0.004136 0
36 3700 0.002783 0.003819 0
37 3800 0.002783 0.003717 1
38 3900 0.002783 0.00833 2
39 4000 0.002783 0.0036 3
40 4100 0.002778 0.003471 0
41 4200 0.002778 0.003285 1
42 4300 0.002778 0.003159 2
43 4400 0.002778 0.003203 3
44 4500 0.002778 0.003137 4
45 4600 0.002658 0.003079 0
46 4700 0.002658 0.003062 1
47 4800 0.002658 0.002958 2
48 4900 0.002658 0.002845 3
49 5000 0.002658 0.002746 4
50 5100 0.002658 0.0027 5
Optimization terminated: maximum number of generations exceeded.
Significant_Masses =
1.0e+003 *
3.5966
7.0455
7.0592
7.5483
7.7336
8.9152
7.7517
8.6322
7.9563
8.7082
8.9225
8.5194
7.3064
8.0939
7.7230
4.3171
7.3461
8.6171
7.1895
7.7464
ans =
1
Display the Features that are Discriminatory
To visualize which features have been selected by the genetic algorithm, the data is plotted with peak positions marked with red vertical lines.
xAxisLabel = 'Mass/Charge (M/Z)'; % x label for plots yAxisLabel = 'Relative Ion Intensity'; % y label for plots figure; hold on; hC = plot(MZ,Y(:,1:15) ,'b'); hN = plot(MZ,Y(:,141:155),'g'); hG = plot(MZ(feat(ceil((1:60 )/3))), repmat([0 100 NaN],1,20),'r'); xlabel(xAxisLabel); ylabel(yAxisLabel); axis([1900 12000 -1 40]); legend([hN(1),hC(1),hG(1)],{'Control','Ovarian Cancer', 'Discriminative Features'},2); title('Mass Spectrometry Data with Discriminative Features found by Genetic Algorithm');
Observe the interesting peak around 8100 Da., which seems to be shifted to the right on healthy samples.
axis([8082 8128 -.5 4])
Store