clustergram

Compute hierarchical clustering, display dendrogram and heat map, and create clustergram object

Syntax

CGobj = clustergram(Data)

CGobj = clustergram(Data, ...'RowLabels', RowLabelsValue, ...)
CGobj = clustergram(Data, ...'ColumnLabels', ColumnLabelsValue, ...)
CGobj = clustergram(Data, ...'Standardize', StandardizeValue, ...)
CGobj = clustergram(Data, ...'Cluster', ClusterValue, ...)
CGobj = clustergram(Data, ...'RowPDist', RowPDistValue, ...)
CGobj = clustergram(Data, ...'ColumnPDist', ColumnPDistValue, ...)
CGobj = clustergram(Data, ...'Linkage', LinkageValue, ...)
CGobj = clustergram(Data, ...'Dendrogram', DendrogramValue, ...)
CGobj = clustergram(Data, ...'OptimalLeafOrder', OptimalLeafOrderValue, ...)
CGobj = clustergram(Data, ...'Colormap', ColormapValue, ...)
CGobj = clustergram(Data, ...'DisplayRange', DisplayRangeValue, ...)
CGobj = clustergram(Data, ...'Symmetric', SymmetricValue, ...)
CGobj = clustergram(Data, ...'LogTrans', LogTransValue, ...)
CGobj = clustergram(Data, ...'DisplayRatio', DisplayRatioValue, ...)
CGobj = clustergram(Data, ...'ImputeFun', ImputeFunValue, ...)
CGobj = clustergram(Data, ...'RowGroupMarker', RowGroupMarkerValue, ...)
CGobj = clustergram(Data, ...'ColumnGroupMarker', ColumnGroupMarkerValue, ...)

Arguments

DataDataMatrix object or numeric matrix of data. If the matrix contains gene expression data, typically each row corresponds to a gene and each column corresponds to a sample.
RowLabelsValue

Vector of numbers or cell array of text strings to label the rows in the dendrogram and heat map. Default is a vector of values 1 through M, where M is the number of rows in Data.

    Note:   If the number of row labels is 200 or more, the labels do not appear in the clustergram plot unless you zoom in on the plot.

ColumnLabelsValue

Vector of numbers or cell array of text strings to label the columns in the dendrogram and heat map. Default is a vector of values 1 through N, where N is the number of columns in Data.

    Note:   If the number of column labels is 200 or more, the labels do not appear in the clustergram plot unless you zoom in on the plot.

StandardizeValue

String or number specifying the dimension for standardizing the values in Data. The clustergram function transforms the standardized values so that the mean is 0 and the standard deviation is 1 in the specified dimension. Choices are:

  • 'column' or 1 — Standardize along the columns of data.

  • 'row' or 2 — Standardize along the rows of data.

  • 'none' or 3 (default) — Do not standardize.

ClusterValue

String or number specifying the dimension for clustering the values in Data. Choices are:

  • 'column' or 1 — Cluster along the columns of data only, which results in clustered rows.

  • 'row' or 2 — Cluster along the rows of data only, which results in clustered columns.

  • 'all' or 3 (default) — Cluster along the columns of data, then cluster along the rows of row-clustered data.

RowPDistValueString, function handle, or cell array specifying the distance metric to pass to the pdist function (Statistics Toolbox™ software) to calculate the pairwise distances between rows. For information on choices, see the pdist function. Default is 'euclidean'.

    Note:   If the distance metric requires extra arguments, then RowistValue is a cell array. For example, to use the Minkowski distance with exponent P, you would use {'minkowski', P}.

ColumnPDistValueString, function handle, or cell array specifying the distance metric to pass to the pdist function (Statistics Toolbox software) to use to calculate the pairwise distances between columns. For information on choices, see the pdist function. Default is 'euclidean'.

    Note:   If the distance metric requires extra arguments, then ColumnPDistValue is a cell array. For example, to use the Minkowski distance with exponent P, you would use {'minkowski', P}.

LinkageValue

String or two-element cell array of strings specifying the linkage method to pass to the linkage function (Statistics Toolbox software) to create the hierarchical cluster tree for rows and columns. If a two-element cell array of strings, the clustergram function uses the first element for linkage between rows, and the second element for linkage between columns. For information on choices, see the linkage function. Default is 'average'.

    Tip   To specify the linkage method for only one dimension, set the other dimension to ''.

DendrogramValue

Scalar or two-element numeric vector or cell array of strings specifying the 'colorthreshold' property to pass to the dendrogram function (Statistics Toolbox software) to create the dendrogram plot. If a two-element numeric vector or cell array, the first element is for the rows, and the second element is for the columns. For more information, see the dendrogram function.

    Tip   To specify the 'colorthreshold' property for only one dimension, set the other dimension to ''.

OptimalLeafOrderValueEnables or disables the optimal leaf ordering calculation, which determines the leaf order that maximizes the similarity between neighboring leaves. Choices are true (enable) or false (disable). Default depends on the size of Data. If the number of rows or columns in Data exceeds 1500, default is false; otherwise, default is true.

    Note:   Disabling the optimal leaf ordering calculation can be useful when working with large data sets, because this calculation consumes a lot of memory and time.

ColormapValue

Either of the following:

  • M-by-3 matrix of RGB values

  • Name of or handle to a function that returns a colormap, such as redgreencmap or redbluecmap

Default is redgreencmap, in which red represents values above the mean, black represents the mean, and green represents values below the mean of a row (gene) across all columns (samples).
DisplayRangeValue

Positive scalar that specifies the display range of standardized values. Default is 3, which means there is a color variation for values between –3 and 3, but values >3 are the same color as 3, and values < –3 are the same color as –3.

For example, if you specify redgreencmap for the 'Colormap' property, pure red represents values ≥ DisplayRangeValue, and pure green represents values ≤ –DisplayRangeValue.

SymmetricValueForces the color scale of the heat map to be symmetric around zero. Choices are true (default) or false.
LogTransValueControls the log2 transform of Data from natural scale. Choices are true or false (default).
DisplayRatioValue

Either of the following:

  • Scalar

  • Two-element vector

This property specifies the ratio of space that the row and column dendrograms occupy relative to the heat map. If DisplayRatioValue is a scalar, the clustergram function uses it as the ratio for both dendrograms. If DisplayRatioValue is a two-element vector, the clustergram function uses the first element for the ratio of the row dendrogram width to the heat map width, and the second element for the ratio of the column dendrogram height to the heat map height. The clustergram function ignores the second element for one-dimensional clustergrams. Default is 1/5.

ImputeFunValue

One of the following:

  • Name of a function that imputes missing data.

  • Handle to a function that imputes missing data.

  • Cell array where the first element is the name of or handle to a function that imputes missing data. The remaining elements are property name/property value pairs used as inputs to the function.

    Caution   If data points are missing, use the 'ImputeFun' property. Otherwise, the clustergram function errors.

RowGroupMarkerValue

Structure or structure array containing information for annotating the groups (clusters) of rows determined by the clustergram function. The structure or structures contain the following fields. If a single structure, then the fields contain a cell array of elements. If a structure array, then the fields contain a single element.

  • GroupNumber — Scalar specifying the row group number to annotate.

  • Annotation — String specifying text to annotate the row group.

  • Color — String or three-element vector of RGB values specifying a color, which the clustergram function uses to label the row group. For more information on specifying colors, see ColorSpec. If this field is empty, default is 'blue'.

ColumnGroupMarkerValue

Structure or structure array containing information for annotating the groups (clusters) of columns determined by the clustergram function. The structure or structures contain the following fields. If a single structure, then the fields contain a cell array of elements. If a structure array, then the fields contain a single element.

  • GroupNumber — Scalar specifying the column group number to annotate.

  • Annotation — String specifying text to annotate the column group.

  • Color — String or three-element vector of RGB values specifying a color, which the clustergram function uses to label the column group. For more information on specifying colors, see ColorSpec. If this field is empty, default is 'blue'.

Description

CGobj = clustergram(Data) performs hierarchical clustering analysis on the values in Data, a DataMatrix object or numeric matrix. It creates CGobj, an object containing the analysis data, and displays a dendrogram and heat map. It uses hierarchical clustering with Euclidean distance metric and average linkage to generate the hierarchical tree. It clusters first along the columns (producing row-clustered data), and then along the rows in the matrix Data. If Data contains gene expression data, typically the rows correspond to genes and the columns correspond to samples.

CGobj = clustergram(Data, ...'PropertyName', PropertyValue, ...) calls clustergram with optional properties that use property name/property value pairs. You can specify one or more properties in any order. Enclose each PropertyName in single quotation marks. Each PropertyName is case insensitive. These property name/property value pairs are as follows:


CGobj = clustergram(Data, ...'RowLabels', RowLabelsValue, ...)
uses the contents of RowLabelsValue, a vector of numbers or cell array of text strings, as labels for the rows in the dendrogram and heat map. Default is a vector of values 1 through M, where M is the number of rows in Data.

CGobj = clustergram(Data, ...'ColumnLabels', ColumnLabelsValue, ...) uses the contents of ColumnLabelsValue, a vector of numbers or cell array of text strings, as labels for the columns in the dendrogram and heat map. Default is a vector of values 1 through M, where M is the number of columns in Data.

CGobj = clustergram(Data, ...'Standardize', StandardizeValue, ...) specifies the dimension for standardizing the values in Data. The clustergram function transforms the standardized values so that the mean is 0 and the standard deviation is 1 in the specified dimension. StandardizeValue can be:

  • 'column' or 1 — Standardize along the columns of data.

  • 'row' or 2 (default) — Standardize along the rows of data.

  • 'none' or 3 — Do not standardize.

CGobj = clustergram(Data, ...'Cluster', ClusterValue, ...) specifies the dimension for clustering the values in Data. ClusterValue can be:

  • 'column' or 1 — Cluster along the columns of data only, which results in clustered rows.

  • 'row' or 2 — Cluster along the rows of data only, which results in clustered columns.

  • 'all' or 3 (default) — Cluster along the columns of data, then cluster along the rows of row-clustered data.

CGobj = clustergram(Data, ...'RowPDist', RowPDistValue, ...) specifies the distance metric to pass to the pdist function (Statistics Toolbox software) to use to calculate the pairwise distances between rows. RowPDistValue is a string, function handle, or cell array. For information on choices, see the pdist function. Default is 'euclidean'.

CGobj = clustergram(Data, ...'ColumnPDist', ColumnPDistValue, ...) specifies the distance metric to pass to the pdist function (Statistics Toolbox software) to use to calculate the pairwise distances between columns. ColumnPDistValue is a string, function handle, or cell array. For information on choices, see the pdist function. Default is 'euclidean'.

    Note:   If the distance metric requires extra arguments, then RowPDistValue or ColumnPDistValue is a cell array. For example, to use the Minkowski distance with exponent P, you would use {'minkowski', P}.

CGobj = clustergram(Data, ...'Linkage', LinkageValue, ...) specifies the linkage method to pass to the linkage function (Statistics Toolbox software) to use to create the hierarchical cluster tree for rows and columns. LinkageValue is a string or two-element cell array of strings. If a two-element cell array of strings, the clustergram function uses first element for linkage between rows, and the second element for linkage between columns. For information on choices, see the linkage function. Default is 'average'.

    Tip   To specify the linkage method for only one dimension, set the other dimension to ''.

CGobj = clustergram(Data, ...'Dendrogram', DendrogramValue, ...) specifies the 'colorthreshold' property to pass to the dendrogram function (Statistics Toolbox software) to create the dendrogram plot. DendrogramValue is a scalar or two-element numeric vector or cell array of strings that specifies the 'colorthreshold' property. If a two-element numeric vector or cell array, the first element is for the rows, and the second element is for the columns. For more information, see the dendrogram function.

    Tip   To specify the 'colorthreshold' property for only one dimension, set the other dimension to ''.

CGobj = clustergram(Data, ...'OptimalLeafOrder', OptimalLeafOrderValue, ...) enables or disables the optimal leaf ordering calculation, which determines the leaf order that maximizes the similarity between neighboring leaves. Choices are true (enable) or false (disable). Default depends on the size of Data. If the number of rows or columns in Data exceeds 1500, default is false; otherwise, default is true.

    Tip   Disabling the optimal leaf ordering calculation can be useful when working with large data sets, because this calculation consumes a lot of memory and time.

CGobj = clustergram(Data, ...'Colormap', ColormapValue, ...) specifies the colormap to use to create the clustergram. The colormap controls the colors used to display the heat map. ColormapValue is either an M-by-3 matrix of RGB values or the name of or handle to a function that returns a colormap, such as redgreencmap or redbluecmap. Default is redgreencmap.

    Note:   In redgreencmap, red represents values above the mean, black represents the mean, and green represents values below the mean of a row (gene) across all columns (samples). In redbluecmap, red represents values above the mean, white represents the mean, and blue represents values below the mean of a row (gene) across all columns (samples).

CGobj = clustergram(Data, ...'DisplayRange', DisplayRangeValue, ...) specifies the display range of standardized values. DisplayRangeValue must be a positive scalar. Default is 3, which means there is a color variation for values between –3 and 3, but values >3 are the same color as 3, and values < –3 are the same color as –3.

For example, if you specify redgreencmap for the 'Colormap' property, pure red represents values ≥ DisplayRangeValue, and pure green represents values ≤ –DisplayRangeValue.

CGobj = clustergram(Data, ...'Symmetric', SymmetricValue, ...) controls whether the color scale of the heat map is symmetric around zero. SymmetricValue can be true (default) or false.

CGobj = clustergram(Data, ...'LogTrans', LogTransValue, ...) controls the log2 transform of Data from natural scale. Choices are true or false (default).

CGobj = clustergram(Data, ...'DisplayRatio', DisplayRatioValue, ...) specifies the ratio of space that the row and column dendrograms occupy relative to the heat map. If DisplayRatioValue is a scalar, the clustergram function uses it as the ratio for both dendrograms. If DisplayRatioValue is a two-element vector, the clustergram function uses the first element for the ratio of the row dendrogram width to the heat map width, and the second element for the ratio of the column dendrogram height to the heat map height. The clustergram function ignores the second element for one-dimensional clustergrams. Default is 1/5.

CGobj = clustergram(Data, ...'ImputeFun', ImputeFunValue, ...) specifies a function and optional inputs that impute missing data. ImputeFunValue can be any of the following:

  • Name of a function that imputes missing data.

  • Handle to a function that imputes missing data.

  • Cell array where the first element is the name of or handle to a function that imputes missing data. The remaining elements are property name/property value pairs used as inputs to the function.

    Tip   If data points are missing, use the 'ImputeFun' property. Otherwise, the clustergram function errors.

CGobj = clustergram(Data, ...'RowGroupMarker', RowGroupMarkerValue, ...) specifies a structure or structure array containing information for annotating the groups (clusters) of rows determined by the clustergram function.

CGobj = clustergram(Data, ...'ColumnGroupMarker', ColumnGroupMarkerValue, ...) specifies a structure or structure array containing information for annotating the groups of columns determined by the clustergram function.

    Tip   If necessary, view row labels (right) and column labels (bottom) by clicking the Zoom In button on the toolbar to zoom the clustergram.

Examples

The following example uses data from an experiment (DeRisi et al., 1997) that used DNA microarrays to study temporal gene expression of almost all genes in Saccharomyces cerevisiae (yeast) during the metabolic shift from fermentation to respiration. Expression levels were measured at seven time points during the diauxic shift.

  1. Load the MAT-file, provided with Bioinformatics Toolbox™, that contains filtered yeast data.

    load filteredyeastdata

    This MAT-file includes three variables, which are added to the MATLAB®Workspace:

    • yeastvalues — A matrix of gene expression data from Saccharomyces cerevisiae (yeast) during the metabolic shift from fermentation to respiration

    • genes — A cell array of GenBank® accession numbers for labeling the rows in yeastvalues

    • times — A vector of time values for labeling the columns in yeastvalues

  2. Create a clustergram object and display the heat map from the gene expression data in the first 30 rows of the yeastvalues matrix and standardize along the rows of data.

    cgo = clustergram(yeastvalues(1:30,:),'Standardize','Row')
    Clustergram object with 30 rows of nodes and 7 columns of nodes.

  3. Use the set method and the genes and times vectors to add meaningful row and column labels to the clustergram.

    set(cgo,'RowLabels',genes(1:30),'ColumnLabels',times)

  4. Add a color bar to the clustergram by clicking the Insert Colorbar button on the toolbar.

  5. View a data tip containing the intensity value, row label, and column label for a specific area of the heat map by clicking the Data Cursor button on the toolbar, then clicking an area in the heat map. To delete this data tip, right-click it, then select Delete Current Datatip.

  6. Display intensity values for each area of the heat map by clicking the Annotate button on the toolbar. Click the Annotate button again to remove the intensity values.

      Tip   If the amount of data is large enough, the cells within the clustergram are too small to display the intensity annotations. Zoom the clustergram to see the intensity annotations.

  7. Remove the dendrogram tree diagrams from the figure by clicking the Show Dendrogram button on the toolbar. Click the Show Dendrogram button again to display the dendrograms.

  8. Use the get method to display the properties of the clustergram object, cgo:

    get(cgo)
    
                     Cluster: 'ALL'
                    RowPDist: {'Euclidean'}
                 ColumnPDist: {'Euclidean'}
                     Linkage: {'Average'}
                  Dendrogram: {}
            OptimalLeafOrder: 1
                    LogTrans: 0
                DisplayRatio: [0.2000 0.2000]
               RowGroupMarker: []
            ColumnGroupMarker: []
              ShowDendrogram: 'on'
                ColumnLabels: {' 9.5'  '   0'  '11.5'  '13.5'  '15.5'  '20.5'  '18.5'}
                   RowLabels: {30x1 cell}
          ColumnLabelsRotate: 90
             RowLabelsRotate: 0
        ColumnLabelsLocation: 'bottom'
           RowLabelsLocation: 'right'
                 Standardize: 'ROW'
                   Symmetric: 1
                DisplayRange: 3
                    Colormap: [11x3 double]
                   ImputeFun: []
                    Annotate: 'off'
              AnnotPrecision: 2
                  AnnotColor: 'w'
           ColumnLabelsColor: []
              RowLabelsColor: []
           LabelsWithMarkers: 0
  9. Change the clustering parameters by changing the linkage method and changing the color of the groups of nodes in the dendrogram whose linkage is less than a threshold of 3.

    set(cgo,'Linkage','complete','Dendrogram',3)

  10. Place the cursor on a branch node in the dendrogram to highlight (in blue) the group associated with it. Press and hold the mouse button to display a data tip listing the group number and the nodes (genes or samples) in the group.

  11. Right-click a branch node in the dendrogram to display a menu of options.

    The following options are available:

    • Set Group Color — Change the cluster group color.

    • Print Group to Figure — Print the group to a Figure window.

    • Copy Group to New Clustergram — Copy the group to a new Clustergram window.

    • Export Group to Workspace — Create a clustergram object of the group in the MATLAB Workspace.

    • Export Group Info to Workspace — Create a structure containing information about the group in the MATLAB Workspace. The structure contains these fields:

      • GroupNames — Cell array of text strings containing the names of the row or column groups.

      • RowNodeNames — Cell array of text strings containing the names of the row nodes.

      • ColumnNodeNames — Cell array of text strings containing the names of the column nodes.

      • ExprValues — An M-by-N matrix of intensity values, where M and N are the number of row nodes and of column nodes respectively. If the matrix contains gene expression data, typically each row corresponds to a gene and each column corresponds to sample.

  12. Create a clustergram object in the MATLAB Workspace of Group 18 by right-clicking it, then selecting Export Group to Workspace. In the Export to Workspace dialog box, type Group18, then click OK.

  13. Use the get method to display the properties of the clustergram object, Group18.

    get(Group18)
    
                     Cluster: 'ALL'
                    RowPDist: {'Euclidean'}
                 ColumnPDist: {'Euclidean'}
                     Linkage: 'complete'
                  Dendrogram: 3
            OptimalLeafOrder: 1
                    LogTrans: 0
                DisplayRatio: [0.2000 0.2000]
              RowGroupMarker: []
           ColumnGroupMarker: []
              ShowDendrogram: 'on'
                ColumnLabels: {' 9.5'  '   0'  '11.5'  '13.5'  '15.5'  '20.5'  '18.5'}
                   RowLabels: {3x1 cell}
          ColumnLabelsRotate: 90
             RowLabelsRotate: 0
        ColumnLabelsLocation: 'bottom'
           RowLabelsLocation: 'right'
                 Standardize: 'ROW'
                   Symmetric: 1
                DisplayRange: 3
                    Colormap: [11x3 double]
                   ImputeFun: []
                    Annotate: 'off'
              AnnotPrecision: 2
                  AnnotColor: 'w'
           ColumnLabelsColor: []
              RowLabelsColor: []
           LabelsWithMarkers: 0
    
  14. Use the view method to view the clustergram (dendrograms and heat map) of the clustergram object, Group18.

    view(Group18)

  15. View all the gene expression data using a diverging red and blue colormap and standardize along the rows of data.

    cgo_all = clustergram(yeastvalues,'Colormap',redbluecmap,'Standardize','Row')
    Clustergram object with 614 rows of nodes and 7 columns of nodes.

  16. Create structure arrays to specify marker colors and annotations for two groups of rows (510 and 593) and two groups of columns (4 and 5).

    rm = struct('GroupNumber',{510,593},'Annotation',{'A','B'},...
         'Color',{'b','m'});
    cm = struct('GroupNumber',{4,5},'Annotation',{'Time1','Time2'},...
         'Color',{[1 1 0],[0.6 0.6 1]});
  17. Use the 'RowGroupMarker' and 'ColumnGroupMarker' properties to add the color markers and annotations to the clustergram.

    set(cgo_all,'RowGroupMarker',rm,'ColumnGroupMarker',cm)

  18. Click the color column markers to display the annotations.

References

[1] Bar-Joseph, Z., Gifford, D.K., and Jaakkola, T.S. (2001). Fast optimal leaf ordering for hierarchical clustering. Bioinformatics 17, Suppl 1:S22 – 9. PMID: 11472989.

[2] Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95, 14863–8.

[3] DeRisi, J.L., Iyer, V.R., and Brown, P.O. (1997). Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, 680–686s.

[4] Golub, T.R., Slonim, D.K., and Tamayo, P., et al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286 (15), 531–537.

Was this topic helpful?