Bioinformatics Toolbox 3.4
Performing a Metagenomic Analysis of a Sargasso Sea Sample
This demo illustrates a simple metagenomic analysis on a sample data set from the Sargasso Sea. It requires the taxonomy information included in the files gi_taxid_prot.dmp, names.dmp and nodes.dmp (see the compressed file taxdump), which you can download from the NCBI taxonomy FTP site.
Contents
- Introduction
- Reading BLASTX Hit Report
- Filtering BLAST Hits
- Memory-Mapping the Taxonomy Data File
- Annotating the BLAST Report with Taxonomic Information
- Classifying BLAST Hits by Scientific Name
- Saving Annotated BLAST Report
- Determining the Taxonomic Distribution of BLAST Hits
- Filtering Out Isolated Assignments
- Plotting Taxonomic Distribution of BLAST Hits
- Limiting the Analysis to the Best Hit for Each Query
- Memory-Mapping the Taxon Node Information
- Classifying BLAST Hits by Higher Taxonomic Rank
- Representing the Taxonomic Distribution on a Graph
- References
Introduction
Metagenomics is the study of the taxonomic composition of a sample of organisms obtained from a common habitat. It usually consists of the comparison of the sequence samples against databases of known sequences and the use of taxonomy information to classify the sample species. The main goals of a metagenomic analysis include the quantification of the relative abundance of known species and the identification of unknown sequences for which no relatives have yet been identified.
Reading BLASTX Hit Report
In this demo, we consider a small subset (100 reads) of the Sargasso Sea data set [1], which has been searched against the NCBI-NR database using BLASTX with default parameters. For convenience, the resulting BLAST report has been saved and compressed into the file sargasso-sample1-100.rpt.gz, and it is provided with Bioinformatics Toolbox™. We read the report content and extract relevant information such as the high-scoring pairs, their score, expectation value and percent identity.
%=== open the blastx report reportFilename = gunzip('sargasso-sample1-100.rpt.gz',tempdir); fid = fopen(reportFilename{1}, 'rt'); %=== read all strings to be able to write into xls blastInfo = textscan(fid, '%s %s %s %s %s %s %s %s %s %s %s %s'); fclose(fid); delete(reportFilename{1}); %=== extract relevant information queries = blastInfo{1}; hits = blastInfo{2}; ident = str2double(blastInfo{3}); evalue = str2double(blastInfo{11}); score = str2double(blastInfo{12}); numEntries = numel(queries)
numEntries =
19817
Filtering BLAST Hits
Because we are interested only in significant hits, we filter the results based on their score, expectation value and percent identity with the query sequences. By using this filtering process, we reduce the number of hits to approximately one quarter of the original hits.
%=== setup filter criteria scoreThreshold = 100; evalueThreshold = 10^-5; identThreshold = 50; %=== consider only hits satisfying the criteria k = find(score > scoreThreshold & evalue < evalueThreshold & i dent > identThreshold); queries = queries(k); hits = hits(k); evalue = evalue(k); score = score(k); numEntries = length(k) %=== clear report clear blastInfo
numEntries =
5252
Memory-Mapping the Taxonomy Data File
The taxonomic classifications for all GenBank® sequences are stored in large files that are updated weekly as new sequences are submitted and the taxonomic information is refined. To retrieve this information in a quick and efficient way, we create a map between any possible gi number in the GenBank database and its associated taxonomic identifier (taxid). Because currently there are more than 100 million live gi numbers, the memory requirements for loading such a large data set can be very demanding. Thus, using the function mapTaxoFile, we read the data in blocks of 1MB, save it as a binary file and then use the function memmapfile to map into memory the content of the file itself, so that the data can be accessed using standard indexing operations. See memmapfile help for more details.
% type mapTaxoFile taxoFilenameIn = 'gi_taxid_prot.dmp'; taxoFilenameOut = 'gi_taxid_prot_map.dmp'; %=== create map so that gi --> taxid, taxid = -1 if no live gi blockSize = 2^20; % block size (1MB) mapTaxoFile(taxoFilenameIn, taxoFilenameOut, blockSize); %=== map file into memory mt = memmapfile(taxoFilenameOut, 'format', 'int32');
We can access the taxid of first ten live GenBank sequences as follows:
q = find(mt.Data(1:100)>0);
mt.Data(q(1:10))
clear q
ans =
9913
9913
9913
9913
9913
9913
9913
9913
9913
9913
Annotating the BLAST Report with Taxonomic Information
We are now interested in performing a taxonomic annotation of each hit in the BLAST report. We extract the gi number of each hit and retrieve its associated taxid.
%=== extract gi number for each hit gi = zeros(1, numEntries); for i = 1:numEntries g = str2double(regexpi(hits{i}, '(?<=gi\|)\d+', 'match', 'once') ); if ~isempty(g) gi(i) = g; end end %=== determine taxid for each hit taxid = mt.Data(gi);
If you performed the BLAST search against a database that is outdated with respect to the taxonomy information included in the nodes.dmp file, some gi numbers might be superseded. Therefore, you need to exclude from the analysis those sequences associated with superseded entries.
%=== ignore dead gi numbers
livegi = (taxid ~= -1);
gi = gi(livegi); taxid = taxid(livegi);
queries = queries(livegi); hits = hits(livegi);
evalue = evalue(livegi); score = score(livegi);
During the search against the NCBI-NR Database, the first query (SHAA001TR) hit n sequences with significant expectation value and score. We can look at the taxonomic assignment of these hits using the array taxid.
n = length(strmatch('SHAA001TR', queries))
hits(1:n)
taxid(1:n)
n =
12
ans =
'gi|118591585|ref|ZP_01548982.1|'
'gi|83951381|ref|ZP_00960113.1|'
'gi|86137830|ref|ZP_01056406.1|'
'gi|149203209|ref|ZP_01880179.1|'
'gi|114769111|ref|ZP_01446737.1|'
'gi|56709160|ref|YP_165205.1|'
'gi|85704868|ref|ZP_01035969.1|'
'gi|110681001|ref|YP_684008.1|'
'gi|121611410|ref|YP_999217.1|'
'gi|99080687|ref|YP_612841.1|'
'gi|84514612|ref|ZP_01001976.1|'
'gi|87119306|ref|ZP_01075204.1|'
ans =
384765
89187
314262
391613
367336
246200
314264
375451
391735
292414
314232
314277
Classifying BLAST Hits by Scientific Name
Every taxid corresponds to a specific taxon, which has been given a scientific name and possibly various synonyms. For our classification purposes, we are interested in the scientific names only. Thus, we extract this information and annotate each BLAST hit in the report using the scientific names, rather than the taxids.
%=== read taxonomy name file taxonomyFilenameIn = 'names.dmp'; fid1 = fopen(taxonomyFilenameIn,'rt'); nameInfo = textscan(fid1, '%d%s%s%s', 'delimiter', '|'); fclose(fid1); %=== preallocate space for SN maxTaxid = max(double(nameInfo{1})); SN = repmat({''}, maxTaxid, 1); %=== populate array so that taxid --> scientific name ind = strmatch('scientific', nameInfo{4}); % indices of scientific name s in the array SN(nameInfo{1}(ind)) = strtrim(nameInfo{2}(ind)); %=== assign name to every hit sciNames = SN(taxid);
We can look at the scientific names of the organisms whose sequences were hit by the first query by considering the first n elements in the array sciNames, as follows:
sciNames(1:n)
ans =
'Labrenzia aggregata IAM 12614'
'Roseovarius nubinhibens ISM'
'Roseobacter sp. MED193'
'Roseovarius sp. TM1035'
'Rhodobacterales bacterium HTCC2255'
'Ruegeria pomeroyi DSS-3'
'Roseovarius sp. 217'
'Roseobacter denitrificans OCh 114'
'Verminephrobacter eiseniae EF01-2'
'Ruegeria sp. TM1040'
'Loktanella vestfoldensis SKA53'
'Marinomonas sp. MED121'
Saving Annotated BLAST Report
Once we determine the taxonomic classification for each hit, we can include the information in a text file as shown below:
%=== create annotated report for first n hits textFilename = 'sargasso-annotated-report.txt'; fid = fopen(textFilename, 'wt'); for i = 1:n fprintf(fid, '%s\t%s\t%d\t%d\t%s\n', queries{i}, hits{i}, eval ue(i), taxid(i), sciNames{i}); end fclose(fid); type sargasso-annotated-report.txt
SHAA001TR gi|118591585|ref|ZP_01548982.1| 2.000000e-090 384765 L abrenzia aggregata IAM 12614 SHAA001TR gi|83951381|ref|ZP_00960113.1| 5.000000e-089 89187 Ros eovarius nubinhibens ISM SHAA001TR gi|86137830|ref|ZP_01056406.1| 4.000000e-087 314262 Ro seobacter sp. MED193 SHAA001TR gi|149203209|ref|ZP_01880179.1| 5.000000e-087 391613 R oseovarius sp. TM1035 SHAA001TR gi|114769111|ref|ZP_01446737.1| 8.000000e-087 367336 R hodobacterales bacterium HTCC2255 SHAA001TR gi|56709160|ref|YP_165205.1| 1.000000e-086 246200 Rueg eria pomeroyi DSS-3 SHAA001TR gi|85704868|ref|ZP_01035969.1| 4.000000e-086 314264 Ro seovarius sp. 217 SHAA001TR gi|110681001|ref|YP_684008.1| 3.000000e-084 375451 Ros eobacter denitrificans OCh 114 SHAA001TR gi|121611410|ref|YP_999217.1| 4.000000e-083 391735 Ver minephrobacter eiseniae EF01-2 SHAA001TR gi|99080687|ref|YP_612841.1| 4.000000e-083 292414 Rueg eria sp. TM1040 SHAA001TR gi|84514612|ref|ZP_01001976.1| 2.000000e-080 314232 Lo ktanella vestfoldensis SKA53 SHAA001TR gi|87119306|ref|ZP_01075204.1| 2.000000e-079 314277 Ma rinomonas sp. MED121
Determining the Taxonomic Distribution of BLAST Hits
One reason to classify the sequence hits in a BLAST report is to study their taxonomic distribution. We can easily create a list of organisms that are represented in the report, their taxids and their frequency as follows:
%=== distribution by taxid taxidList = unique(taxid); % list of unique taxids T = accumarray(taxid, 1); % multiplicity of taxids taxidCount = T(unique(taxid)); % number of hits for each taxon %=== simple statistics of the hit distribution numTaxa = length(taxidList) % number of distinct taxa [maxCount,maxInd] = max(taxidCount); % most represented taxon maxTaxid = taxidList(maxInd) % taxid of the most represented taxon maxSN = SN(maxTaxid) % name of the most represented taxon maxCount
numTaxa =
837
maxTaxid =
269483
maxSN =
'Burkholderia sp. 383'
maxCount =
45
From the simple statistics on the taxonomic distribution, we observe that the most represented taxon is the Burkholderia sp.383 (taxid 269483). The over-representation of this bacterium, which is usually found in terrestrial settings, in the Sample 1 of the Sargasso Sea data set is discussed in [1].
Filtering Out Isolated Assignments
Several taxa in the report appear to be isolated assignments because they are hit by only one sequence. These taxa are rarely true members of the environmental community under investigation. Thus it is useful to identify them and discard them if needed.
t1 = taxidCount == 1; isolated = length(find(t1)) taxidList = taxidList(~t1); taxidCount = taxidCount(~t1); numTaxaFiltered = length(taxidCount)
isolated = 298 numTaxaFiltered = 539
Plotting Taxonomic Distribution of BLAST Hits
If we plot the taxonomic distribution of the hits on a bar chart, we observe that the majority of taxa has a low number of occurrences.
%=== plot by sorting the counts hFig = figure(); bar(sort(taxidCount)); xlabel('Distinct taxonomic assignments'); ylabel('Number of hits'); title('Taxonomic distribution of filtered hits'); set(get(hFig, 'children'), 'XTickLabel', '')
Limiting the Analysis to the Best Hit for Each Query
We can repeat the above procedure by limiting the analysis to only the best-scoring hit for each query sequence. Even though analyses limited to the best-scoring hits cannot depict a complete and accurate picture of the situation, they can be useful as a first approximation and overcome the difficulty inherent with large data sets.
%== get only best hits [queriesUnique, idx] = unique(queries, 'first'); % best hits rows bestHitTaxid = taxid(idx); bestHitSciName = sciNames(idx); %=== count occurrences T = accumarray(bestHitTaxid, 1); % multiplicity of taxids bestCount = T(unique(bestHitTaxid)); % number of hits for each taxon bestCountNames = SN(unique(bestHitTaxid)); %=== five most represented taxa [bestCountSorted, idx] = sort(bestCount, 'descend'); bestCountSorted(1:5) bestCountNames(idx(1:5))
ans =
16
8
8
4
3
ans =
'Burkholderia sp. 383'
'Candidatus Pelagibacter ubique HTCC1002'
'Candidatus Pelagibacter ubique HTCC1062'
'Shewanella sp. ANA-3'
'Shewanella sp. MR-7'
In our example, when only the best-scoring hits are considered, Burkholderia, Candidatus pelagibacter ubique and Shewanella appear to be the most represented taxa in the report. While finding Candidatus pelagibacter ubique is not surprising, because it is a dominant form of life in the Sargasso Sea, Burkholderia and Shewanella are not expected to be present in this marine sample where nutrients and resources are low, because they live either in terrestrial settings or in aquatic, nutrient-rich environments respectively. For a detailed discussion regarding the presence of these bacteria in the Sargasso Sea, see [1].
Memory-Mapping the Taxon Node Information
Oftentimes, to gain a clear vision of the taxonomic distribution of a sequence set, Linnaean categories higher than species are considered. To perform this analysis, we need to create a map between each taxid and its assigned rank, as well as a map between each taxid and the taxid of its parent node, according to the NCBI Taxonomy Database schema.
%type mapNodeFile nodeFilename = 'nodes.dmp'; parentFilename = 'nodes_parent_map.dmp'; rankFilename = 'nodes_rank_map.dmp'; %=== create a map mapNodeFile(nodeFilename, parentFilename, rankFilename, blockSize); %=== map the files into memory mmParentObj = memmapfile(parentFilename, 'format', 'int32'); % taxid --> taxid_parent mmRankObj = memmapfile(rankFilename, 'format', 'int32'); % taxid --> rank
Classifying BLAST Hits by Higher Taxonomic Rank
After the maps are created, for every hit that is associated with a taxid corresponding to a Linnaean category more specific than the target rank, we determine the parental taxid and its rank until the target is reached. Then we annotate the hit with the taxid of its more distant ancestor. Synthetic constructs or nodes with no rank are considered descendants of the root. This procedure of walking up the taxonomic hierarchy is performed by the function findTaxoRank.
%type findTaxoRank
Suppose we are interested in classifying our hits according to the superkingdom to which they belong. After assigning the superkigdom taxid to each hit, we group and count the occurrences as follows:
%=== find superkingdom assignments skRank = findTaxoRank(taxidList, mmRankObj, mmParentObj, 1); sk = accumarray(skRank, 1); skCount = sk(unique(skRank)); skNames = SN(unique(skRank)); %=== plot pie chart hFig = figure(); pie(skCount); colormap(summer) legend(skNames, 'location', 'EastOutside');
As expected, the majority of the hits are bacteria. Similarly, we can determine the taxonomic distribution at the level of phylum, class, order and family as shown below:
rTargetString = {'phylum', 'class', 'order', 'family'}
rTarget = [5 8 11 14];
numTarget = numel(rTarget);
rank = cell(1,numTarget);
%=== annotate hits with the taxid at the target level
for i = 1:numTarget
rank{i} = findTaxoRank(taxidList, mmRankObj, mmParentObj, rTarget(i
));
end
%=== determine the distribution
count = cell(1,numTarget);
names = cell(1,numTarget);
for i = 1:numTarget
list = unique(rank{i});
T = accumarray(rank{i}, 1);
count{i} = T(list);
names{i} = SN(list);
end
%=== plot the first two classifications
for i = 1:2
figure(); barh(count{i});
set(gca, 'yTick', 1:numel(names{i}));
set(gca, 'yTicklabel', names{i});
xlabel('Occurrences');
title(['Taxonomic distribution at the ' rTargetString{i} ' level'])
end
%=== Draw a Pareto chart for the phyla
pnames = names{1}; pcount = count{1};
[ppeaks, pind] = sort(pcount, 'descend');
plabels = pnames(pind);
figure(); pareto(pcount, pnames);
ylabel('Occurrences');
pticks = get(gca, 'XTick');
ht = text(pticks', ppeaks(pticks)+10, plabels(pticks), 'rotation', 45);
title('Pareto chart for distribution at the phylum level');
set(gca, 'XtickLabel', '')
rTargetString =
'phylum' 'class' 'order' 'family'
Representing the Taxonomic Distribution on a Graph
The taxonomic distributions at different levels are related to each other by hierarchy. Suppose we want to look at the distribution of hits across phyla and visualize them on a graph. After filtering out the counts of the low represented phyla (<5 counts), we create a connectivity matrix where all phyla are direct children of the root.
k = count{1} > 5;
phylaNames = names{1}(k);
n1 = length(phylaNames);
CM = zeros(n1); CM(1,2:end) = 1;
bg = biograph(CM, phylaNames);
view(bg)
We can now consider all the hits classified as Proteobacteria (taxid 1224), and perform the same distribution analysis at the level of classes.
%=== consider only Proteobacteria pb = taxidList(rank{1} == 1224); pbRank = findTaxoRank(pb, mmRankObj, mmParentObj, 8); pbList = unique(pbRank); pbT = accumarray(pbRank, 1); pbCount = pbT(pbList); pbNames = SN(pbList); %=== filter out if less than 5 counts h = pbCount > 5; pbCount = pbCount(h); n2 = length(pbCount); pbNames = pbNames(h)
pbNames =
'Gammaproteobacteria'
'Alphaproteobacteria'
'Betaproteobacteria'
'Deltaproteobacteria'
'Epsilonproteobacteria'
To represent the various phyla and the Proteobacteria class in the same graph, we need to create a connectivity matrix such that all the phyla are children of the root and all the Proteobacteria classes are children of the Proteobacteria node. In the graph, the labels include the class names and the number of occurrences in the BLAST report.
%=== find Proteobacteria node x = strmatch('Proteobacteria', phylaNames); %=== combine names and counts numNodes = n1 + n2; allNames(1:n1) = phylaNames; allNames(n1+1:numNodes) = pbNames; allCount(1:n1) = count{1}(k); allCount(n1+1:numNodes) = pbCount; %=== create labels for nodes (scientific name and count) labels = cell(1,numNodes); for node = 1:numNodes labels{node} = [allNames{node} ' (' num2str(allCount(node)) ') ']; end %=== create graph CM = zeros(numNodes); CM(1,2:n1) = 1; CM(x, n1+1:numNodes) = 1; CM(x,x) = 0; bg = biograph(CM, labels, 'showArrows', 'off'); bg.view %=== clear memory mapped variables clear mmParentObj mmRankObj mt
References
[1] Venter, J.C., et al. (2004). Environmental genome shotgun sequencing of the Sargasso sea. Science, 304, 66-74.
Store