MATLAB Examples

Annotating BLAST Reports with Taxonomy Information

This demo illustrates a simple approach to provide taxonomy annotation of BLAST hits. It requires the taxonomy information included in the files gi_taxid_prot.dmp and nodes.dmp, which can be downloaded from the NCBI taxonomy FTP site.

Contents

Introduction

The NCBI Taxonomy database is a curated set of taxonomic classifications for all the organisms that are represented in GenBank. Each taxon in the database is associated with a numerical unique identifier called taxid. The taxonomic classifications for all genomic and proteomic sequences are stored in several files, which are updated weekly since new sequences are continously submitted to GenBank and taxonomic information is refined. In the present analysis, a test dataset of short genomic reads is compared against a database of known sequences using BLAST. The taxonomic information of these known sequence database is then used to determine the taxonomic content of the test dataset by comparison, in virtue of sequence similarity.

Reading BLASTX Hit Report

The test dataset consists of 100 sequences (reads) randomly extracted from the E.coli K12 genome, with a length of approximately 100 bases. The BLAST search against the NCBI-NR database using the blastx program and a tabular formatting option produced the report saved into the file named ecoli_f100_100BLAST_mod.rpt.

reportFilename = 'ecoli_f100_100BLAST_mod.rpt';
fid = fopen(reportFilename, 'rt');

%=== read all strings to be able to write into xls
blastInfo = textscan(fid, '%s %s %s %s %s %s %s %s %s %s %s %s %s %s');
fclose(fid);

%=== extract relevant hit information
queries = blastInfo{1};
hits = blastInfo{2};
evalue = str2double(blastInfo{13});
score = str2double(blastInfo{14});

Filtering BLAST Hits

Since not all reported hits are statistically significant nor relevant, we consider only hits with a score higher than 40 bits and an e-value lower than e-05. Other criteria can be used, depending on the type of analysis under consideration.

%=== filter criteria
scoreThreshold = 40;
evalueThreshold = 10^-5;

%=== consider only hits with score > scoreThreshold and evalue < evalueThreshold
k = find(score > scoreThreshold & evalue < evalueThreshold);
queries = queries(k);
hits = hits(k);
evalue = evalue(k);
score = score(k);
numEntries = numel(queries);
clear blastInfo

Mapping the Taxonomy Data File

In order to retrieve the taxonomic information in a quick and efficient way, we have created a map between any possible gi number in the GenBank database and its associated taxid. As of July 2007, there are more than 100 million live gi numbers and the memory requirements for loading such a large dataset can be very demanding. Thus, we read the data in blocks of 1MB, save it as binary file and then use the MATLAB function memmapfile to map into memory the content of the file itself, so that the data can be access using standard indexing operations. See the documentation for memmapfile for more details.

type mapTaxoFile

taxonomyFilenameIn = 'gi_taxid_prot.dmp';
taxonomyFilenameOut = 'gi_taxid_prot_map.dmp';

%=== preprocessing info so that gi --> taxid (taxid = -1 if invalid gi)
blockSize = 2^20; % block size (1MB)
mapTaxoFile(taxonomyFilenameIn, taxonomyFilenameOut, blockSize);

%=== map the file into memory
mt = memmapfile(taxonomyFilenameOut, 'format', 'int32');
function mapTaxoFile(taxonomyFilenameIn, taxonomyFilenameOut, blockSize)

if exist(taxonomyFilenameOut,'file')
    taxonomyFilenameInInfo = dir(taxonomyFilenameIn);
    taxonomyFilenameOutInfo = dir(taxonomyFilenameOut);
    if taxonomyFilenameOutInfo.datenum > taxonomyFilenameInInfo.datenum
       warning('mapTaxoFile:FileExists','Memory mapped file exists.')
       return
    end
end

fid1 = fopen(taxonomyFilenameIn,'rt'); % from NCBI TAXONOMY FTP site
fid2 = fopen(taxonomyFilenameOut, 'w');% binary file used for mapping

curr = 1; % current gi to consider

while(~feof(fid1))
    
    data = textscan(fid1, '%d %d', blockSize);
    gi = data{1};
    taxa = data{2};
    gap = gi(1) - curr; 
    
    %=== missing gi numbers between blocks are assigned a taxid = -1
    if gap
        D = -1 * ones(gap, 1); 
        fwrite(fid2, D, 'int32');
    end
        
    %=== populate array D such that D(gi) = taxid of gi
    curr = gi(end) + 1;   % current gi position in the final list
    offset = min(gi) - 1; % starting gi in the current block
    N = max(gi) - offset; % number of gi's  to consider
    D = -1 * ones(N,1);   
    D(gi - offset) = taxa;
       
    %=== write array D into binary file
    fwrite(fid2, D, 'int32');
end
fclose all;
    

Classifying BLAST Hits by Taxid

Once the correspondence between gi numbers and taxids is created, we annotate each BLAST hit in the report by retrieving the taxid associated with the subject sequence, thus providing a taxonomy annotation of query sequences by comparison.

%=== extract gi number for each hit
gi = zeros(1, numEntries);
for i = 1:numEntries
    g = str2double(regexpi(hits{i}, '(?<=gi\|)\d+', 'match', 'once'));
    if ~isempty(g)
        gi(i) = g;
    end
end

%=== determine taxid for each gi number hit
taxid = mt.Data(gi);

%=== e.g., taxid of the 8th hit in the report
taxid(8)
ans =

       83333

Classifying BLAST Hits by Scientific Name

Similarly, we can annotate each BLAST hit in the report using the scientific names, rather than the taxids.

%=== read taxonomy node file
taxonomyFilenameIn = 'names.dmp';
fid1 = fopen(taxonomyFilenameIn,'rt');
nameInfo = textscan(fid1, '%d%s%s%s', 'delimiter', '|');
fclose(fid1);

%=== preallocate space for SN
maxTaxid = max(double(nameInfo{1}));
SN = repmat({''}, maxTaxid, 1);

%=== populate array so that taxid --> scientific name
ind = strmatch('scientific', nameInfo{4}); % indices of scientific names in the array
SN(nameInfo{1}(ind)) = nameInfo{2}(ind);

%=== assign name to every hit
sciNames = SN(taxid);

%=== e.g., scientific name of the 8th hit in the report
strtrim(sciNames(8))
ans = 

    'Escherichia coli K12'

Determining the Taxonomic Distribution of BLAST Hits

One interesting aspect of classifying the sequences in a BLAST report is to study their taxonomic distribution. Since we are not particularly interested in those species that are hit only once in the BLAST search, we filter out the isolated assignments. We then create a list of the species represented, their taxids and their frequency.

%=== distribution by taxid
taxidList = unique(taxid);       % list of unique taxids
T = accumarray(taxid, 1);        % multiplicity of taxids
taxidCount = T(unique(taxid));   % number of hits for each taxon
numTaxa = length(taxidCount)

%=== filter isolated assignments
t1 = taxidCount ~= 1;
taxidListFiltered = taxidList(t1);
taxidCountFiltered = taxidCount(t1);
numTaxaFiltered = length(taxidCountFiltered)

%=== consider the scientific name of distict taxa represented in the report
taxidListNames = {SN{taxidList}};
taxidListNameFiltered = {SN{taxidListFiltered}};

%===plot
figure(); bar(taxidCount);
xlabel('Distinct taxonomic assignments');
ylabel('Number of hits');
title('Taxonomic distribution of hits in BLAST report');

%=== replot by sorting the counts
figure(); bar(sort(taxidCountFiltered));
xlabel('Distinct taxonomic assignments');
ylabel('Number of hits');
title('Taxonomic distribution of sorted, filtered hits in BLAST report');
numTaxa =

   279


numTaxaFiltered =

   156

For instance, we can look at the 5 most represented species in the report, or we can determine how many distinct species are hit by the current BLAST search.

%=== number of distinct taxa represented in the report
numTaxa = length(taxidList)

%=== assign name
[a,b] = sort(taxidCount, 'descend');

%=== consider the two most represented species and their frequency
twoMostRepr = strtrim(taxidListNames(b(1:2))')
twoMostReprFreq = a(1:2) / sum(a)
numTaxa =

   279


twoMostRepr = 

    'Escherichia coli'
    'Escherichia coli O157:H7 EDL933'


twoMostReprFreq =

    0.0437
    0.0302

Limiting the Analysis to Best-hit for Each Query

We can repeat the above procedure by limiting the analysis only to the best-scoring hit for each query sequence.

[a, b] = unique(queries, 'first'); % best hits rows
bestHitTaxid = taxid(b);
bestHitSciName = SN(b);

T = accumarray(bestHitTaxid, 1);        % multiplicity of taxids
bestCount = T(unique(bestHitTaxid));   % number of hits for each taxon
bestCountNames = SN(unique(bestHitTaxid));

%=== plot
[a,b] = sort(bestCount);
figure(); pie(a);
legend(bestCountNames(b), 'fontSize', 8, 'location', 'bestoutside')
set(gcf, 'Position', [531 678 800 420])
title('Taxonomic distribution at species level for best-scoring hits')

Memory-Mapping the Taxon Node Information

Oftentimes the taxonomic annotation ought to be done at a level higher than the species level. Information about the hierarchy within the NCBI Taxonomy Database is included in the file nodes.dmp. Similarly to the mapping created for the gi numbers and taxids, we create a correspondence between taxids and rank in the tree, and also between taxids and its parent node in the Taxonomy Database.

type mapNodeFile

fnameIn = 'nodes.dmp';
mmfname1= 'nodes_parent_map.dmp';
mmfname2= 'nodes_rank_map.dmp';

%=== preprocessing info so that taxid --> taxid_parent and taxid --> rank
blockSize = 2^20; % block size (1MB)
mapNodeFile(fnameIn, mmfname1, mmfname2, blockSize);

%=== map the files into memory
mm1 = memmapfile(mmfname1, 'format', 'int32'); % taxid --> taxid_parent
mm2 = memmapfile(mmfname2, 'format', 'int32'); % taxid --> rank
function mapNodeFile(nodeFilenameIn, nodeFilenameOut1, nodeFilenameOut2, blockSize)

if exist(nodeFilenameOut1,'file') && exist(nodeFilenameOut2,'file')
    nodeFilenameInInfo = dir(nodeFilenameIn);
    nodeFilenameOut1Info = dir(nodeFilenameOut1);
    nodeFilenameOut2Info = dir(nodeFilenameOut2);
    if nodeFilenameOut1Info.datenum > nodeFilenameInInfo.datenum && ...
       nodeFilenameOut2Info.datenum > nodeFilenameInInfo.datenum         
       warning('mapNodeFile:FileExists','Memory mapped file exists.')
       return
    end
end


fid1 = fopen(nodeFilenameIn,'rt'); % from NCBI TAXONOMY FTP site
fid2 = fopen(nodeFilenameOut1, 'w');% binary file used for mapping
fid3 = fopen(nodeFilenameOut2, 'w');% binary file used for mapping
curr = 1; % current taxid to consider
rank = {'superkingdom', 'kingdom', 'phylum', 'subphylum', 'class', 'subclass', 'order', 'family', 'genus', 'species', 'no rank'};
code = {1,2,3,4,5,6,7,8,9,10,11};

while(~feof(fid1))
 
  
    %=== node with rank outside the rank list above are assigned a rank =
    %11 (no rank)
    
    data = textscan(fid1, '%d%d%s%s%d%d%d%d%d%d%d%d%s', blockSize, 'delimiter', '|'); 
    node = data{1};
    parent = data{2};
    classif = strtrim(data{3});
    classifInt = ones(numel(classif),1) * 11;
    for i = 1:numel(rank)
        K = strmatch(rank{i}, classif); % entries that are classified with the specific rank
        classifInt(K) = code{i}; % assign rank code
    end
    gap = node(1) - curr; 
    
      %=== missing node numbers between blocks are assigned a rank = -1
    if gap
        P = -1 * ones(gap, 1); % parent
        R = -1 * ones(gap, 1); % rank
        fwrite(fid2, P, 'int32');
        fwrite(fid3, R, 'int32');
    end
        
    %=== populate array P such that P(node) = parent of node
    %=== populate array R such that R(node) = rank of node (integer)
    curr = node(end) + 1;   % current gi position in the final list
    offset = min(node) - 1; % starting gi in the current block
    N = max(node) - offset; % number of gi's  to consider
    
    %=== if not found, the parent is the root and the rank is 1
    P = ones(N,1);   
    P(node - offset) = parent;
    R = ones(N,1);
    R(node - offset) = classifInt;
          
    %=== write array D into binary file
    fwrite(fid2, P, 'int32');
    fwrite(fid3, R, 'int32');
end
fclose all;
    


Classifying BLAST Hits by Higher Taxonomic Rank

Each hit in the BLAST report can be annotated at higher levels by travelling the taxonomy classification tree up to the rank of interest (e.g. family). Taxonomy nodes whose rank is not included in the given list of interest are considered as 'no rank' nodes.

rTarget = 8; % target rank
rTargetString = 'family';

propTaxid = taxid; % taxid of propagated hits
c = [];            % classified query numbers
class = [];        % highest rank taxid of classified queries
queryNum = (1: numEntries)';

%=== remove hits classified above the target
rHits = mm2.Data(propTaxid);
n = find(rHits < rTarget);
queryNum(n) = [];
propTaxid(n) = [];
rHits(n) = [];

while ~isempty(propTaxid)

    %=== break if we have reached the root
    if all(propTaxid == 1)
        c = [c; queryNum];
        class = [class; propTaxid];
        break;
    end

    %=== identify rank of propagated hits
    propRank = mm2.Data(propTaxid);

    %=== determine hits for which classification is done
    y = find(propRank == rTarget | propRank == 1);
    c = [c; queryNum(y)];
    class = [class; propTaxid(y)];

    %=== remove classified from set to propagate
    queryNum(y) =[];
    propTaxid(y) = [];
    propRank(y) = [];

    %=== propagate remaining nodes
    propTaxid = mm1.Data(propTaxid);
end

Determining the Taxonomic Distribution of BLAST Hits at higher rank level

Once the taxonomic annotation at higher rank level has been performed, we can analyze the taxonomic distribution as following.

%=== distribution by family (or higher rank)
classTaxidList = unique(class);
T = accumarray(class, 1);
classCount = T(unique(class));
classNameList = SN(classTaxidList)

%=== simple statistics of the hit (family or higher rank) distribution
numTaxa = length(classTaxidList) % number of distinct taxa represented in the report

[maxCount,maxInd] = max(classCount) % maximum number of hit for a single family (or higher rank)
maxTaxid = taxidList(maxInd) % taxid of the most hit family (or higher rank)
maxTaxidName = classNameList{maxInd}

maxClassFract = maxCount/numEntries % fraction of hits belonging to the most represented family

%=== filter out isolated assignments
t2 = classCount ~= 1;
classTaxidListFiltered = classTaxidList(t2);
classCountFiltered = classCount(t2);

classNameListFiltered = {SN{classTaxidListFiltered}};

%=== plot
figure(); bar(classCountFiltered);
xlabel('Taxa');
ylabel('Number of hits');
title(['Taxonomic distribution of filtered hits in BLAST report by ' rTargetString]);

%=== replot sorted counts
figure(); bar(sort(classCountFiltered));
xlabel(['Distinct ' rTargetString ' assignments']);
ylabel('Number of hits');
title(['Taxonomic distribution of filtered hits in BLAST report by ' rTargetString]);

%=== annotate plot
[a,b]= sort(classCountFiltered, 'descend');
annotation('textarrow', [.68,.78], [.8,.8], 'string', classNameListFiltered(b(1)), 'color', 'red');
annotation('textarrow', [.65,.75], [.2,.2], 'string', classNameListFiltered(b(2)), 'color', 'blue');
classNameList = 

    'root	'
    'Bacteria	'
    'Moraxellaceae	'
    'Neisseriaceae	'
    'Alcaligenaceae	'
    'Enterobacteriaceae	'
    'Vibrionaceae	'
    'Pasteurellaceae	'
    'Halobacteriaceae	'
    'Apidae	'
    'Halomonadaceae	'
    'Clostridiaceae	'
    'Rhodobacteraceae	'
    'Methylophilaceae	'
    'Francisellaceae	'
    'Rhodospirillaceae	'
    'Sphingomonadaceae	'
    'Phyllobacteriaceae	'
    'Hyphomonadaceae	'
    'Alteromonadaceae	'
    'Campylobacteraceae	'
    'Oxalobacteraceae	'
    'Rhodocyclaceae	'
    'Caulobacteraceae	'
    'Comamonadaceae	'
    'Enterococcaceae	'
    'Rhizobiaceae	'
    'Aeromonadaceae	'
    'Nocardiaceae	'
    'Coxiellaceae	'
    'Methylobacteriaceae	'
    'Burkholderiaceae	'
    'Oceanospirillaceae	'
    'Pseudomonadaceae	'
    'Porphyromonadaceae	'
    'Peptococcaceae	'
    'Listeriaceae	'
    'Desulfovibrionaceae	'
    'Desulfobulbaceae	'
    'Geobacteraceae	'
    'Pelobacteraceae	'
    'Alcanivoracaceae	'
    'Hahellaceae	'
    'Aurantimonadaceae	'
    'Pseudoalteromonadaceae	'
    'Colwelliaceae	'
    'Shewanellaceae	'
    'Moritellaceae	'
    'Idiomarinaceae	'
    'Psychromonadaceae	'
    'Xanthobacteraceae	'


numTaxa =

    51


maxCount =

   739


maxInd =

     6


maxTaxid =

         552


maxTaxidName =

Enterobacteriaceae	


maxClassFract =

    0.6205

Saving Annotated BLAST report

Once the taxonomic classification for each hit has been determined, the information can be saved for further analysis in a MATLAB file, or appended to the BLAST report in text format, or even exported into an Excel file, if such application is available in your system.

%=== save into MATLAB file
save ecoli_f100_BLAST100_annotated.mat queries hits evalue score taxid sciNames

%=== create a text files
fid = fopen('ecoli_f100_BLAST100_annotated.txt', 'wt');
for i = 1:length(taxid)
    fprintf(fid, '%s %s %d %d %d %s\n', queries{i}, hits{i}, score(i), evalue(i), taxid(i), sciNames{i});
end
fclose(fid);

%=== export info into EXCEL file
xlsFilename = 'ecoli_f100_BLAST100_annotated.xls';
taxoStrings = cellfun(@num2str, num2cell(taxid), 'UniformOutput', false);
evalueStrings = cellfun(@num2str, num2cell(evalue), 'UniformOutput', false);
scoreStrings = cellfun(@num2str, num2cell(score), 'UniformOutput', false);
xlswrite(xlsFilename, [queries evalueStrings scoreStrings taxoStrings sciNames]);