Note: This page has been translated by MathWorks. Please click here

To view all translated materals including this page, select Japan from the country navigator on the bottom of this page.

To view all translated materals including this page, select Japan from the country navigator on the bottom of this page.

This example demonstrates a number of ways to look for patterns in gene expression profiles, using gene expression data from yeast shifting from fermentation to respiration.

The microarray data for this example is from DeRisi, J.L., Iyer,
V.R., and Brown, P.O. (Oct 24, 1997). Exploring the metabolic and
genetic control of gene expression on a genomic scale. Science, *278
(5338)*, 680–686. PMID: 9381177.

The authors used DNA microarrays to study temporal gene expression
of almost all genes in *Saccharomyces cerevisiae* during
the metabolic shift from fermentation to respiration. Expression levels
were measured at seven time points during the diauxic shift. The full
data set can be downloaded from the Gene Expression Omnibus Web site
at:

This procedure illustrates how to import data from the Web into
the MATLAB^{®} environment. The data for this procedure is available
in the MAT-file `yeastdata.mat`

. This file contains
the VALUE data or LOG_RAT2N_MEAN, or log2 of ratio of CH2DN_MEAN and
CH1DN_MEAN from the seven time steps in the experiment, the names
of the genes, and an array of the times at which the expression levels
were measured.

Load data into the MATLAB environment.

`load yeastdata.mat`

Get the size of the data by typing

numel(genes)

The number of genes in the data set displays in the MATLAB Command Window. The MATLAB variable

`genes`

is a cell array of the gene names.ans = 6400

Access the entries using cell array indexing.

genes{15}

This displays the 15th row of the variable

`yeastvalues`

, which contains expression levels for the open reading frame (ORF) YAL054C.ans = YAL054C

Use the function

`web`

to access information about this ORF in the Saccharomyces Genome Database (SGD).url = sprintf(... 'http://www.yeastgenome.org/cgi-bin/locus.fpl?locus=%s',... genes{15}); web(url);

A simple plot can be used to show the expression profile for this ORF.

plot(times, yeastvalues(15,:)) xlabel('Time (Hours)'); ylabel('Log2 Relative Expression Level');

The MATLAB software plots the figure. The values are log

_{2}ratios.Plot the actual values.

plot(times, 2.^yeastvalues(15,:)) xlabel('Time (Hours)'); ylabel('Relative Expression Level');

The MATLAB software plots the figure. The gene associated with this ORF, ACS1, appears to be strongly up-regulated during the diauxic shift.

Compare other genes by plotting multiple lines on the same figure.

hold on plot(times, 2.^yeastvalues(16:26,:)') xlabel('Time (Hours)'); ylabel('Relative Expression Level'); title('Profile Expression Levels');

The MATLAB software plots the image.

This procedure illustrates how to filter the data by removing
genes that are not expressed or do not change. The data set is quite
large and a lot of the information corresponds to genes that do not
show any interesting changes during the experiment. To make it easier
to find the interesting genes, reduce the size of the data set by
removing genes with expression profiles that do not show anything
of interest. There are `6400`

expression profiles.
You can use a number of techniques to reduce the number of expression
profiles to some subset that contains the most significant genes.

If you look through the gene list you will see several spots marked as

`'EMPTY'`

. These are empty spots on the array, and while they might have data associated with them, for the purposes of this example, you can consider these points to be noise. These points can be found using the`strcmp`

function and removed from the data set with indexing commands.`emptySpots = strcmp('EMPTY',genes); yeastvalues(emptySpots,:) = []; genes(emptySpots) = []; numel(genes)`

The MATLAB software displays:

ans = 6314

In the

`yeastvalues`

data you will also see several places where the expression level is marked as`NaN`

. This indicates that no data was collected for this spot at the particular time step. One approach to dealing with these missing values would be to impute them using the mean or median of data for the particular gene over time. This example uses a less rigorous approach of simply throwing away the data for any genes where one or more expression levels were not measured.Use the

`isnan`

function to identify the genes with missing data and then use indexing commands to remove the genes.nanIndices = any(isnan(yeastvalues),2); yeastvalues(nanIndices,:) = []; genes(nanIndices) = []; numel(genes)

The MATLAB software displays:

ans = 6276

If you were to plot the expression profiles of all the remaining profiles, you would see that most profiles are flat and not significantly different from the others. This flat data is obviously of use as it indicates that the genes associated with these profiles are not significantly affected by the diauxic shift. However, in this example, you are interested in the genes with large changes in expression accompanying the diauxic shift. You can use filtering functions in the toolbox to remove genes with various types of profiles that do not provide useful information about genes affected by the metabolic change.

Use the function

`genevarfilter`

to filter out genes with small variance over time. The function returns a logical array of the same size as the variable`genes`

with ones corresponding to rows of`yeastvalues`

with variance greater than the`10`

th percentile and zeros corresponding to those below the threshold.mask = genevarfilter(yeastvalues); % Use the mask as an index into the values to remove the % filtered genes. yeastvalues = yeastvalues(mask,:); genes = genes(mask); numel(genes)

The MATLAB software displays:

ans = 5648

The function

`genelowvalfilter`

removes genes that have very low absolute expression values. Note that the gene filter functions can also automatically calculate the filtered data and names.[mask, yeastvalues, genes] = genelowvalfilter(yeastvalues,genes,... 'absval',log2(4)); numel(genes)

The MATLAB software displays:

ans = 423

Use the function

`geneentropyfilter`

to remove genes whose profiles have low entropy:[mask, yeastvalues, genes] = geneentropyfilter(yeastvalues,genes,... 'prctile',15); numel(genes)

The MATLAB software displays:

ans = 310

Now that you have a manageable list of genes, you can look for relationships between the profiles using some different clustering techniques from the Statistics and Machine Learning Toolbox™ software.

For hierarchical clustering, the function

`pdist`

calculates the pairwise distances between profiles, and the function`linkage`

creates the hierarchical cluster tree.corrDist = pdist(yeastvalues, 'corr'); clusterTree = linkage(corrDist, 'average');

The function

`cluster`

calculates the clusters based on either a cutoff distance or a maximum number of clusters. In this case, the`'maxclust'`

option is used to identify`16`

distinct clusters.`clusters = cluster(clusterTree, 'maxclust', 16);`

The profiles of the genes in these clusters can be plotted together using a simple loop and the function

`subplot`

.figure for c = 1:16 subplot(4,4,c); plot(times,yeastvalues((clusters == c),:)'); axis tight end suptitle('Hierarchical Clustering of Profiles');

The MATLAB software plots the images.

The Statistics and Machine Learning Toolbox software also has a K-means clustering function. Again, 16 clusters are found, but because the algorithm is different these are not necessarily the same clusters as those found by hierarchical clustering.

[cidx, ctrs] = kmeans(yeastvalues, 16,... 'dist','corr',... 'rep',5,... 'disp','final'); figure for c = 1:16 subplot(4,4,c); plot(times,yeastvalues((cidx == c),:)'); axis tight end suptitle('K-Means Clustering of Profiles');

The MATLAB software displays:

13 iterations, total sum of distances = 11.4042 14 iterations, total sum of distances = 8.62674 26 iterations, total sum of distances = 8.86066 22 iterations, total sum of distances = 9.77676 26 iterations, total sum of distances = 9.01035

Instead of plotting all of the profiles, you can plot just the centroids.

figure for c = 1:16 subplot(4,4,c); plot(times,ctrs(c,:)'); axis tight axis off % turn off the axis end suptitle('K-Means Clustering of Profiles');

The MATLAB software plots the figure:

You can use the function

`clustergram`

to create a heat map and dendrogram from the output of the hierarchical clustering.figure clustergram(yeastvalues(:,2:end),'RowLabels',genes,... 'ColumnLabels',times(2:end))

The MATLAB software plots the figure:

Principal-component analysis (PCA) is a useful technique you can use to reduce the dimensionality of large data sets, such as those from microarray analysis. You can also use PCA to find signals in noisy data.

Use the

`pca`

function in the Statistics and Machine Learning Toolbox software to calculate the principal components of a data set.[pc, zscores, pcvars] = pca(yeastvalues)

The MATLAB software displays:

pc = Columns 1 through 4 -0.0245 -0.3033 -0.1710 -0.2831 0.0186 -0.5309 -0.3843 -0.5419 0.0713 -0.1970 0.2493 0.4042 0.2254 -0.2941 0.1667 0.1705 0.2950 -0.6422 0.1415 0.3358 0.6596 0.1788 0.5155 -0.5032 0.6490 0.2377 -0.6689 0.2601 Columns 5 through 7 -0.1155 0.4034 0.7887 -0.2384 -0.2903 -0.3679 -0.7452 -0.3657 0.2035 -0.2385 0.7520 -0.4283 0.5592 -0.2110 0.1032 -0.0194 -0.0961 0.0667 -0.0673 -0.0039 0.0521

You can use the function

`cumsum`

to see the cumulative sum of the variances.cumsum(pcvars./sum(pcvars) * 100)

The MATLAB software displays:

ans = 78.3719 89.2140 93.4357 96.0831 98.3283 99.3203 100.0000

This shows that almost 90% of the variance is accounted for by the first two principal components.

A scatter plot of the scores of the first two principal components shows that there are two distinct regions. This is not unexpected, because the filtering process removed many of the genes with low variance or low information. These genes would have appeared in the middle of the scatter plot.

figure scatter(zscores(:,1),zscores(:,2)); xlabel('First Principal Component'); ylabel('Second Principal Component'); title('Principal Component Scatter Plot');

The MATLAB software plots the figure:

The

`gname`

function from the Statistics and Machine Learning Toolbox software can be used to identify genes on a scatter plot. You can select as many points as you like on the scatter plot.gname(genes);

When you have finished selecting points, press

**Enter**.An alternative way to create a scatter plot is with the

`gscatter`

function from the Statistics and Machine Learning Toolbox software.`gscatter`

creates a grouped scatter plot where points from each group have a different color or marker. You can use`clusterdata`

, or any other clustering function, to group the points.figure pcclusters = clusterdata(zscores(:,1:2),6); gscatter(zscores(:,1),zscores(:,2),pcclusters) xlabel('First Principal Component'); ylabel('Second Principal Component'); title('Principal Component Scatter Plot with Colored Clusters'); gname(genes) % Press enter when you finish selecting genes.

The MATLAB software plots the figure:

Was this topic helpful?