Annotating BLAST Reports with Taxonomy Information
This demo illustrates a simple approach to provide taxonomy annotation of BLAST hits. It requires the taxonomy information included in the files gi_taxid_prot.dmp and nodes.dmp, which can be downloaded from the NCBI taxonomy FTP site.
Contents
- Introduction
- Reading BLASTX Hit Report
- Filtering BLAST Hits
- Mapping the Taxonomy Data File
- Classifying BLAST Hits by Taxid
- Classifying BLAST Hits by Scientific Name
- Determining the Taxonomic Distribution of BLAST Hits
- Limiting the Analysis to Best-hit for Each Query
- Memory-Mapping the Taxon Node Information
- Classifying BLAST Hits by Higher Taxonomic Rank
- Determining the Taxonomic Distribution of BLAST Hits at higher rank level
- Saving Annotated BLAST report
Introduction
The NCBI Taxonomy database is a curated set of taxonomic classifications for all the organisms that are represented in GenBank. Each taxon in the database is associated with a numerical unique identifier called taxid. The taxonomic classifications for all genomic and proteomic sequences are stored in several files, which are updated weekly since new sequences are continously submitted to GenBank and taxonomic information is refined. In the present analysis, a test dataset of short genomic reads is compared against a database of known sequences using BLAST. The taxonomic information of these known sequence database is then used to determine the taxonomic content of the test dataset by comparison, in virtue of sequence similarity.
Reading BLASTX Hit Report
The test dataset consists of 100 sequences (reads) randomly extracted from the E.coli K12 genome, with a length of approximately 100 bases. The BLAST search against the NCBI-NR database using the blastx program and a tabular formatting option produced the report saved into the file named ecoli_f100_100BLAST_mod.rpt.
reportFilename = 'ecoli_f100_100BLAST_mod.rpt'; fid = fopen(reportFilename, 'rt'); %=== read all strings to be able to write into xls blastInfo = textscan(fid, '%s %s %s %s %s %s %s %s %s %s %s %s %s %s'); fclose(fid); %=== extract relevant hit information queries = blastInfo{1}; hits = blastInfo{2}; evalue = str2double(blastInfo{13}); score = str2double(blastInfo{14});
Filtering BLAST Hits
Since not all reported hits are statistically significant nor relevant, we consider only hits with a score higher than 40 bits and an e-value lower than e-05. Other criteria can be used, depending on the type of analysis under consideration.
%=== filter criteria scoreThreshold = 40; evalueThreshold = 10^-5; %=== consider only hits with score > scoreThreshold and evalue < evalueThreshold k = find(score > scoreThreshold & evalue < evalueThreshold); queries = queries(k); hits = hits(k); evalue = evalue(k); score = score(k); numEntries = numel(queries); clear blastInfo
Mapping the Taxonomy Data File
In order to retrieve the taxonomic information in a quick and efficient way, we have created a map between any possible gi number in the GenBank database and its associated taxid. As of July 2007, there are more than 100 million live gi numbers and the memory requirements for loading such a large dataset can be very demanding. Thus, we read the data in blocks of 1MB, save it as binary file and then use the MATLAB function memmapfile to map into memory the content of the file itself, so that the data can be access using standard indexing operations. See the documentation for memmapfile for more details.
type mapTaxoFile taxonomyFilenameIn = 'gi_taxid_prot.dmp'; taxonomyFilenameOut = 'gi_taxid_prot_map.dmp'; %=== preprocessing info so that gi --> taxid (taxid = -1 if invalid gi) blockSize = 2^20; % block size (1MB) mapTaxoFile(taxonomyFilenameIn, taxonomyFilenameOut, blockSize); %=== map the file into memory mt = memmapfile(taxonomyFilenameOut, 'format', 'int32');
function mapTaxoFile(taxonomyFilenameIn, taxonomyFilenameOut, blockSize)
if exist(taxonomyFilenameOut,'file')
taxonomyFilenameInInfo = dir(taxonomyFilenameIn);
taxonomyFilenameOutInfo = dir(taxonomyFilenameOut);
if taxonomyFilenameOutInfo.datenum > taxonomyFilenameInInfo.datenum
warning('mapTaxoFile:FileExists','Memory mapped file exists.')
return
end
end
fid1 = fopen(taxonomyFilenameIn,'rt'); % from NCBI TAXONOMY FTP site
fid2 = fopen(taxonomyFilenameOut, 'w');% binary file used for mapping
curr = 1; % current gi to consider
while(~feof(fid1))
data = textscan(fid1, '%d %d', blockSize);
gi = data{1};
taxa = data{2};
gap = gi(1) - curr;
%=== missing gi numbers between blocks are assigned a taxid = -1
if gap
D = -1 * ones(gap, 1);
fwrite(fid2, D, 'int32');
end
%=== populate array D such that D(gi) = taxid of gi
curr = gi(end) + 1; % current gi position in the final list
offset = min(gi) - 1; % starting gi in the current block
N = max(gi) - offset; % number of gi's to consider
D = -1 * ones(N,1);
D(gi - offset) = taxa;
%=== write array D into binary file
fwrite(fid2, D, 'int32');
end
fclose all;
Classifying BLAST Hits by Taxid
Once the correspondence between gi numbers and taxids is created, we annotate each BLAST hit in the report by retrieving the taxid associated with the subject sequence, thus providing a taxonomy annotation of query sequences by comparison.
%=== extract gi number for each hit gi = zeros(1, numEntries); for i = 1:numEntries g = str2double(regexpi(hits{i}, '(?<=gi\|)\d+', 'match', 'once')); if ~isempty(g) gi(i) = g; end end %=== determine taxid for each gi number hit taxid = mt.Data(gi); %=== e.g., taxid of the 8th hit in the report taxid(8)
ans =
83333
Classifying BLAST Hits by Scientific Name
Similarly, we can annotate each BLAST hit in the report using the scientific names, rather than the taxids.
%=== read taxonomy node file taxonomyFilenameIn = 'names.dmp'; fid1 = fopen(taxonomyFilenameIn,'rt'); nameInfo = textscan(fid1, '%d%s%s%s', 'delimiter', '|'); fclose(fid1); %=== preallocate space for SN maxTaxid = max(double(nameInfo{1})); SN = repmat({''}, maxTaxid, 1); %=== populate array so that taxid --> scientific name ind = strmatch('scientific', nameInfo{4}); % indices of scientific names in the array SN(nameInfo{1}(ind)) = nameInfo{2}(ind); %=== assign name to every hit sciNames = SN(taxid); %=== e.g., scientific name of the 8th hit in the report strtrim(sciNames(8))
ans =
'Escherichia coli K12'
Determining the Taxonomic Distribution of BLAST Hits
One interesting aspect of classifying the sequences in a BLAST report is to study their taxonomic distribution. Since we are not particularly interested in those species that are hit only once in the BLAST search, we filter out the isolated assignments. We then create a list of the species represented, their taxids and their frequency.
%=== distribution by taxid taxidList = unique(taxid); % list of unique taxids T = accumarray(taxid, 1); % multiplicity of taxids taxidCount = T(unique(taxid)); % number of hits for each taxon numTaxa = length(taxidCount) %=== filter isolated assignments t1 = taxidCount ~= 1; taxidListFiltered = taxidList(t1); taxidCountFiltered = taxidCount(t1); numTaxaFiltered = length(taxidCountFiltered) %=== consider the scientific name of distict taxa represented in the report taxidListNames = {SN{taxidList}}; taxidListNameFiltered = {SN{taxidListFiltered}}; %===plot figure(); bar(taxidCount); xlabel('Distinct taxonomic assignments'); ylabel('Number of hits'); title('Taxonomic distribution of hits in BLAST report'); %=== replot by sorting the counts figure(); bar(sort(taxidCountFiltered)); xlabel('Distinct taxonomic assignments'); ylabel('Number of hits'); title('Taxonomic distribution of sorted, filtered hits in BLAST report');
numTaxa = 279 numTaxaFiltered = 156
For instance, we can look at the 5 most represented species in the report, or we can determine how many distinct species are hit by the current BLAST search.
%=== number of distinct taxa represented in the report numTaxa = length(taxidList) %=== assign name [a,b] = sort(taxidCount, 'descend'); %=== consider the two most represented species and their frequency twoMostRepr = strtrim(taxidListNames(b(1:2))') twoMostReprFreq = a(1:2) / sum(a)
numTaxa =
279
twoMostRepr =
'Escherichia coli'
'Escherichia coli O157:H7 EDL933'
twoMostReprFreq =
0.0437
0.0302
Limiting the Analysis to Best-hit for Each Query
We can repeat the above procedure by limiting the analysis only to the best-scoring hit for each query sequence.
[a, b] = unique(queries, 'first'); % best hits rows bestHitTaxid = taxid(b); bestHitSciName = SN(b); T = accumarray(bestHitTaxid, 1); % multiplicity of taxids bestCount = T(unique(bestHitTaxid)); % number of hits for each taxon bestCountNames = SN(unique(bestHitTaxid)); %=== plot [a,b] = sort(bestCount); figure(); pie(a); legend(bestCountNames(b), 'fontSize', 8, 'location', 'bestoutside') set(gcf, 'Position', [531 678 800 420]) title('Taxonomic distribution at species level for best-scoring hits')
Memory-Mapping the Taxon Node Information
Oftentimes the taxonomic annotation ought to be done at a level higher than the species level. Information about the hierarchy within the NCBI Taxonomy Database is included in the file nodes.dmp. Similarly to the mapping created for the gi numbers and taxids, we create a correspondence between taxids and rank in the tree, and also between taxids and its parent node in the Taxonomy Database.
type mapNodeFile fnameIn = 'nodes.dmp'; mmfname1= 'nodes_parent_map.dmp'; mmfname2= 'nodes_rank_map.dmp'; %=== preprocessing info so that taxid --> taxid_parent and taxid --> rank blockSize = 2^20; % block size (1MB) mapNodeFile(fnameIn, mmfname1, mmfname2, blockSize); %=== map the files into memory mm1 = memmapfile(mmfname1, 'format', 'int32'); % taxid --> taxid_parent mm2 = memmapfile(mmfname2, 'format', 'int32'); % taxid --> rank
function mapNodeFile(nodeFilenameIn, nodeFilenameOut1, nodeFilenameOut2, blockSize)
if exist(nodeFilenameOut1,'file') && exist(nodeFilenameOut2,'file')
nodeFilenameInInfo = dir(nodeFilenameIn);
nodeFilenameOut1Info = dir(nodeFilenameOut1);
nodeFilenameOut2Info = dir(nodeFilenameOut2);
if nodeFilenameOut1Info.datenum > nodeFilenameInInfo.datenum && ...
nodeFilenameOut2Info.datenum > nodeFilenameInInfo.datenum
warning('mapNodeFile:FileExists','Memory mapped file exists.')
return
end
end
fid1 = fopen(nodeFilenameIn,'rt'); % from NCBI TAXONOMY FTP site
fid2 = fopen(nodeFilenameOut1, 'w');% binary file used for mapping
fid3 = fopen(nodeFilenameOut2, 'w');% binary file used for mapping
curr = 1; % current taxid to consider
rank = {'superkingdom', 'kingdom', 'phylum', 'subphylum', 'class', 'subclass', 'order', 'family', 'genus', 'species', 'no rank'};
code = {1,2,3,4,5,6,7,8,9,10,11};
while(~feof(fid1))
%=== node with rank outside the rank list above are assigned a rank =
%11 (no rank)
data = textscan(fid1, '%d%d%s%s%d%d%d%d%d%d%d%d%s', blockSize, 'delimiter', '|');
node = data{1};
parent = data{2};
classif = strtrim(data{3});
classifInt = ones(numel(classif),1) * 11;
for i = 1:numel(rank)
K = strmatch(rank{i}, classif); % entries that are classified with the specific rank
classifInt(K) = code{i}; % assign rank code
end
gap = node(1) - curr;
%=== missing node numbers between blocks are assigned a rank = -1
if gap
P = -1 * ones(gap, 1); % parent
R = -1 * ones(gap, 1); % rank
fwrite(fid2, P, 'int32');
fwrite(fid3, R, 'int32');
end
%=== populate array P such that P(node) = parent of node
%=== populate array R such that R(node) = rank of node (integer)
curr = node(end) + 1; % current gi position in the final list
offset = min(node) - 1; % starting gi in the current block
N = max(node) - offset; % number of gi's to consider
%=== if not found, the parent is the root and the rank is 1
P = ones(N,1);
P(node - offset) = parent;
R = ones(N,1);
R(node - offset) = classifInt;
%=== write array D into binary file
fwrite(fid2, P, 'int32');
fwrite(fid3, R, 'int32');
end
fclose all;
Classifying BLAST Hits by Higher Taxonomic Rank
Each hit in the BLAST report can be annotated at higher levels by travelling the taxonomy classification tree up to the rank of interest (e.g. family). Taxonomy nodes whose rank is not included in the given list of interest are considered as 'no rank' nodes.
rTarget = 8; % target rank rTargetString = 'family'; propTaxid = taxid; % taxid of propagated hits c = []; % classified query numbers class = []; % highest rank taxid of classified queries queryNum = (1: numEntries)'; %=== remove hits classified above the target rHits = mm2.Data(propTaxid); n = find(rHits < rTarget); queryNum(n) = []; propTaxid(n) = []; rHits(n) = []; while ~isempty(propTaxid) %=== break if we have reached the root if all(propTaxid == 1) c = [c; queryNum]; class = [class; propTaxid]; break; end %=== identify rank of propagated hits propRank = mm2.Data(propTaxid); %=== determine hits for which classification is done y = find(propRank == rTarget | propRank == 1); c = [c; queryNum(y)]; class = [class; propTaxid(y)]; %=== remove classified from set to propagate queryNum(y) =[]; propTaxid(y) = []; propRank(y) = []; %=== propagate remaining nodes propTaxid = mm1.Data(propTaxid); end
Determining the Taxonomic Distribution of BLAST Hits at higher rank level
Once the taxonomic annotation at higher rank level has been performed, we can analyze the taxonomic distribution as following.
%=== distribution by family (or higher rank) classTaxidList = unique(class); T = accumarray(class, 1); classCount = T(unique(class)); classNameList = SN(classTaxidList) %=== simple statistics of the hit (family or higher rank) distribution numTaxa = length(classTaxidList) % number of distinct taxa represented in the report [maxCount,maxInd] = max(classCount) % maximum number of hit for a single family (or higher rank) maxTaxid = taxidList(maxInd) % taxid of the most hit family (or higher rank) maxTaxidName = classNameList{maxInd} maxClassFract = maxCount/numEntries % fraction of hits belonging to the most represented family %=== filter out isolated assignments t2 = classCount ~= 1; classTaxidListFiltered = classTaxidList(t2); classCountFiltered = classCount(t2); classNameListFiltered = {SN{classTaxidListFiltered}}; %=== plot figure(); bar(classCountFiltered); xlabel('Taxa'); ylabel('Number of hits'); title(['Taxonomic distribution of filtered hits in BLAST report by ' rTargetString]); %=== replot sorted counts figure(); bar(sort(classCountFiltered)); xlabel(['Distinct ' rTargetString ' assignments']); ylabel('Number of hits'); title(['Taxonomic distribution of filtered hits in BLAST report by ' rTargetString]); %=== annotate plot [a,b]= sort(classCountFiltered, 'descend'); annotation('textarrow', [.68,.78], [.8,.8], 'string', classNameListFiltered(b(1)), 'color', 'red'); annotation('textarrow', [.65,.75], [.2,.2], 'string', classNameListFiltered(b(2)), 'color', 'blue');
classNameList =
'root '
'Bacteria '
'Moraxellaceae '
'Neisseriaceae '
'Alcaligenaceae '
'Enterobacteriaceae '
'Vibrionaceae '
'Pasteurellaceae '
'Halobacteriaceae '
'Apidae '
'Halomonadaceae '
'Clostridiaceae '
'Rhodobacteraceae '
'Methylophilaceae '
'Francisellaceae '
'Rhodospirillaceae '
'Sphingomonadaceae '
'Phyllobacteriaceae '
'Hyphomonadaceae '
'Alteromonadaceae '
'Campylobacteraceae '
'Oxalobacteraceae '
'Rhodocyclaceae '
'Caulobacteraceae '
'Comamonadaceae '
'Enterococcaceae '
'Rhizobiaceae '
'Aeromonadaceae '
'Nocardiaceae '
'Coxiellaceae '
'Methylobacteriaceae '
'Burkholderiaceae '
'Oceanospirillaceae '
'Pseudomonadaceae '
'Porphyromonadaceae '
'Peptococcaceae '
'Listeriaceae '
'Desulfovibrionaceae '
'Desulfobulbaceae '
'Geobacteraceae '
'Pelobacteraceae '
'Alcanivoracaceae '
'Hahellaceae '
'Aurantimonadaceae '
'Pseudoalteromonadaceae '
'Colwelliaceae '
'Shewanellaceae '
'Moritellaceae '
'Idiomarinaceae '
'Psychromonadaceae '
'Xanthobacteraceae '
numTaxa =
51
maxCount =
739
maxInd =
6
maxTaxid =
552
maxTaxidName =
Enterobacteriaceae
maxClassFract =
0.6205
Saving Annotated BLAST report
Once the taxonomic classification for each hit has been determined, the information can be saved for further analysis in a MATLAB file, or appended to the BLAST report in text format, or even exported into an Excel file, if such application is available in your system.
%=== save into MATLAB file save ecoli_f100_BLAST100_annotated.mat queries hits evalue score taxid sciNames %=== create a text files fid = fopen('ecoli_f100_BLAST100_annotated.txt', 'wt'); for i = 1:length(taxid) fprintf(fid, '%s %s %d %d %d %s\n', queries{i}, hits{i}, score(i), evalue(i), taxid(i), sciNames{i}); end fclose(fid); %=== export info into EXCEL file xlsFilename = 'ecoli_f100_BLAST100_annotated.xls'; taxoStrings = cellfun(@num2str, num2cell(taxid), 'UniformOutput', false); evalueStrings = cellfun(@num2str, num2cell(evalue), 'UniformOutput', false); scoreStrings = cellfun(@num2str, num2cell(score), 'UniformOutput', false); xlswrite(xlsFilename, [queries evalueStrings scoreStrings taxoStrings sciNames]);
