MATLAB Examples

Contents

function t_test_boxplot(filename_healthy,filename_disease,yaxis_label)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Name - t_test_boxplot
% Creation Date - 27th July 2014
% Author: Soumya Banerjee
% Website: https://sites.google.com/site/neelsoumya/
%
% Description:
%  Takes measurements of genes/compounds/bugs/metabolites etc in healthy
%  (control) individuals and individuals with some disease. Then does a
%  t-test to find out how many of these genes/compounds/metabolites/bugs,
%  etc are differentially abundnant in healthy vs. disease in a
%  statistically significant manner.
%  Also generates box plots for each of the measured genes/compounds/bugs/metabolites
%  The data for healthy individuals and individuals with some disease must be supplied in
%  separate files. The rows will represent different measured genes/compounds/bugs/metabolites
%  and the columns will be different individuals.
%
% Input:
%       1) filename_healthy: Name of file containing matrix of measurements
%           of some quantity (genes/metabolites/bugs/compounds, etc) in healthy (control) individuals.
%           The columns are replicates or different individuals, that is (say) the same compound
%           measured in multiple healthy individuals.
%           The rows are different compounds.
%           There may be a header row that has some descriptive text (not
%           necessary).
%           It is necessary to have the first column of this file be a text
%           only column that has some description of the quantity being
%           measured. For example, if a compound is being measured in
%           multiple individuals, then the name of that compound should be
%           in the first column. If this is all too complicated, dont worry
%           (!), because two example files are attached.
%       2) filename_disease: Name of file containing matrix of measurements
%           of some quantity (genes/metabolites/bugs/compounds, etc) in individuals with some disease.
%           Same restrictions as above apply.
%       3) yaxis_label:   label for y-axis of boxplots (for example compound abundance)
%
% Output:
%       1) A file containing results from a t-test comparing each row in the
%           file filename_healthy to each row in the file filename_disease.
%           This includes results for an unpaired two-sample, two-tailed
%           t-test (t-score and p-value) and a Benjamini-Hochberg corrected
%           FDR (False Discovery Rate)
%       2) The number of compounds/genes/metabolites etc (represented by the
%           rows) that are differentially abundant in healthy vs. disease
%           (statistically significant at p-value < 0.05 and also FDR q-value <
%           0.05)
%       3) A number boxplot of each of the measured quantities (say compound abundance) in disease vs. healthy
%       4) A combined boxplot that combines all boxplots in 3) into one
%           (requires ghostscript to be installed and needs to be run on
%           UNIX/Mac OS X). Currently commented out.
%
% Example usage:
%       t_test_boxplot('combined_healthy_compound_preds.txt','combined_disease_compound_preds.txt','Compound abundance')
%
% License - BSD
%
% Change History -
%                   27th July 2014  - Creation by Soumya Banerjee
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Load the data

data_ptr_healthy = importdata(filename_healthy);
data_ptr_disease = importdata(filename_disease);

Initial checks

number of rows in healthy file must be the same as the number of rows in disease file. This reflects the fact that both must have measured the same number of quantities (genes, compounds, bugs, metabolites, etc)

if size(data_ptr_healthy.data,1) ~= size(data_ptr_disease.data,1)
    disp('The number of rows in healthy file must be the same as the number of rows in')
    disp('disease file. This reflects the fact that both must have measured the same number of quantities (genes, compounds, bugs, metabolites, etc)')
    return
end

iNumRows = size(data_ptr_healthy.data,1);

Perform an unpaired two-sample two-tailed t-test

[PValues tscore] = mattest(data_ptr_healthy.data,data_ptr_disease.data,'showplot',true,'showhist',true);
% other options to mattest that can be used (like permutation etc)
% [PValues tscore] = mattest(data_ptr_healthy.data,data_ptr_disease.data,'VarType','unequal','permute',true,'showplot',true,'showhist',true)

Perform a Benjamini-Hochberg correction of p-values to yield FDR (False Discovery Rate)

FDR = mafdr(PValues,'BHFDR',true);

Generate a volcano plot (optional: commented out)

%mavolcanoplot(data_ptr_healthy.data, data_ptr_disease.data,PValues)

Calculate number of measurements (compounds, metabolites, bugs, etc)

that are differentially abundant (statistically significant) in healthy vs. disease

temp_index_sign_FDR = find(FDR < 0.05);
disp('Number of measurements (compounds, metabolites, bugs, etc) that are differentially abundant (statistically significant) in healthy vs. disease at FDR < 0.05')
size(temp_index_sign_FDR,1)

temp_index_sign_pvalue = find(PValues < 0.05);
disp('Number of measurements (compounds, metabolites, bugs, etc) that are differentially abundant (statistically significant) in healthy vs. disease at p-value < 0.05')
size(temp_index_sign_pvalue,1)
Number of measurements (compounds, metabolites, bugs, etc) that are differentially abundant (statistically significant) in healthy vs. disease at FDR < 0.05

ans =

     0

Number of measurements (compounds, metabolites, bugs, etc) that are differentially abundant (statistically significant) in healthy vs. disease at p-value < 0.05

ans =

     5

Save list of p-values, t-score and FDR values to disk

dlmwrite('vital_stats.txt',[tscore PValues FDR ],'delimiter','\t')
disp('Statistics saved in file vital_stats.txt')
Statistics saved in file vital_stats.txt

Create box plot

% Iterate through all rows
for iCount = 1:iNumRows

Combined data for healthy and disease

    data_box_plot = [ data_ptr_healthy.data(iCount,:)' ; data_ptr_disease.data(iCount,:)'];

Replicates labels

    str_group = [repmat('Healthy',size(data_ptr_healthy.data(iCount,:),2),1); repmat('Disease',size(data_ptr_disease.data(iCount,:),2),1)];

create figure for boxplot

    temp_figID = figure;
    boxplot(data_box_plot,str_group,'notch','on')
    title(data_ptr_healthy.textdata(iCount))
    ylabel(yaxis_label)

save boxplot to disk

    print(temp_figID, '-dpdf', sprintf('boxplot_%d_%s.pdf', iCount, date));
end
disp('Box plots saved in file names boxplot_*.pdf')
Box plots saved in file names boxplot_*.pdf

Combine pdfs of all boxplots into one single pdf

CAUTION - requires ghostscript and a UNIX/Mac OS X system

Commented out. Uncomment if you meet both of these requirements This code courtesy of Decio Biavati at http://decio.blogspot.de/2009/01/join-merge-pdf-files-in-linux.html

%unix('gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=finished.pdf *.pdf')