Bioinformatics Toolbox 3.4
Working with Whole Genome Data
Whole genomes are available for human, mouse, rat, fugu, and several other model organisms. For many of these organisms one chromosome can be several hundred million base pairs long. Working with such large data sets can be challenging as you may run into limitations of the hardware and software that you are using. This example shows one way to work around these limitations in MATLAB®.
Contents
Large Data Set Handling Issues
Solving technical computing problems that require processing and analyzing large amounts of data puts a high demand on your computer system. Large data sets take up significant memory during processing and can require many operations to compute a solution. It can also take a long time to access information from large data files.
Computer systems, however, have limited memory and finite CPU speed. Available resources vary by processor and operating system, the latter of which also consumes resources. For example:
32-bit processors and operating systems can address up to 2^32 = 4,294,967,296 = 4 GB of memory (also known as virtual address space). Windows® XP and Windows® 2000 allocate only 2 GB of this virtual memory to each process (such as MATLAB). On UNIX®, the virtual memory allocated to a process is system-configurable and is typically around 3 GB. The application carrying out the calculation, such as MATLAB, can require storage in addition to the user task. The main problem when handling large amounts of data is that the memory requirements of the program can exceed that available on the platform. For example, MATLAB generates an "out of memory" error when data requirements exceed approximately 1.7 GB on Windows XP.
More details on memory management and large data sets can be found in this MATLAB Digest article or in the Memory Management Guide in our support site.
On a typical 32-bit machine, the maximum size of a single data set that you can work with in MATLAB is a few hundred MB, or about the size of a large chromosome. Memory mapping of files allows MATLAB to work around this limitation and enables you to work with very large data sets in an intuitive way.
Whole Genome Data Sets
The latest whole genome data sets can be downloaded from the Ensembl Website. The data are provided in several formats. These are updated regularly as new sequence information becomes available. This example will use human DNA data stored in FASTA format. Chromosome 1 is (in the NCBI36.50 Release of July 2008) a 62.4 MB compressed file. After uncompressing the file it is about 250MB. MATLAB uses 2 bytes per character, so if you read the file into MATLAB, it will require about 500MB of memory.
This demonstration assumes that you have already downloaded and uncompressed the FASTA file into your local directory, change the name of the variable FASTAfilename if appropriate.
FASTAfilename = 'Homo_sapiens.NCBI36.50.dna.chromosome.1.fa';
fileInfo = dir(FASTAfilename)
fileInfo =
name: 'Homo_sapiens.NCBI36.50.dna.chromosome.1.fa'
date: '04-Jul-2008 11:35:56'
bytes: 251370600
isdir: 0
datenum: 7.3359e+005
Memory Mapped Files
Memory mapping allows MATLAB to access data in a file as though it is in memory. You can use standard MATLAB indexing operations to access data. See the documentation for memmapfile for more details.
You could just map the FASTA file and access the data directly from there. However the FASTA format file includes new line characters. The memmapfile function treats these characters in the same way as all other characters. Removing these before memory mapping the file will make indexing operations simpler. Also, memory mapping does not work directly with character data so you will have to treat the data as 8-bit integers (uint8 class). The function nt2int in the Bioinformatics Toolbox™ can be used to convert character information into integer values. int2nt is used to convert back to characters.
First open the FASTA file and extract the header.
fidIn = fopen(FASTAfilename,'r');
header = fgetl(fidIn)
header = >1 dna:chromosome chromosome:NCBI36:1:1:247249719:1
Open the file to be memory mapped.
[fullPath, filename, extension] = fileparts(FASTAfilename); mmFilename = [filename '.mm'] fidOut = fopen(mmFilename,'w');
mmFilename = Homo_sapiens.NCBI36.50.dna.chromosome.1.mm
Read the FASTA file in blocks of 1MB, remove new line characters, convert to uint8, and write to the MM file.
newLine = sprintf('\n'); blockSize = 2^20; while ~feof(fidIn) % Read in the data charData = fread(fidIn,blockSize,'*char')'; % Remove new lines charData = strrep(charData,newLine,''); % Convert to integers intData = nt2int(charData); % Write to the new file fwrite(fidOut,intData,'uint8'); end
Close the files.
fclose(fidIn); fclose(fidOut);
The new file is about the same size as the old file but does not contain new lines or the header information.
mmfileInfo = dir(mmFilename)
mmfileInfo =
name: 'Homo_sapiens.NCBI36.50.dna.chromosome.1.mm'
date: '29-Jul-2008 14:15:57'
bytes: 247249719
isdir: 0
datenum: 7.3362e+005
Accessing the Data in the Memory Mapped File
The command memmapfile constructs a memmapfile object that maps the new file to memory. In order to do this, it needs to know the format of the file. The format of this file is simple, though much more complicated formats can be mapped.
chr1 = memmapfile(mmFilename, 'format', 'uint8')
chr1 =
Filename: 'C:\Work\Biomemorymap\Homo_sapiens.NCBI36.50.dna.chromosome.1.
mm'
Writable: false
Offset: 0
Format: 'uint8'
Repeat: Inf
Data: 247249719x1 uint8 array
The MEMMAPFILE Object
The memmapfile object has various properties. Filename stores the full path to the file. Writable indicates whether or not the data can be modified. Note that if you do modify the data, this will also modify the original file. Offset allows you to specify the space used by any header information. Format indicates the data format. Repeat is used to specify how many blocks (as defined by Format) to map. This can be useful for limiting how much memory is used to create the memory map. These properties can be accessed in the same way as other MATLAB data. For more details see type help memmapfile or doc memmapfile.
chr1.Data(1:10)
ans =
4
1
1
2
2
2
4
1
1
2
You can access any region of the data using indexing operations.
chr1.Data(10000000:10000010)'
ans =
4 3 3 2 2 1 1 4 4 4 4
Remember that the nucleotide information was converted to integers. You can use int2nt to get the sequence information back.
int2nt(chr1.Data(10000000:10000010)')
ans = TGGCCAATTTT
Or use seqdisp to display the sequence.
seqdisp(chr1.Data(10000000:10001000)')
ans = 1 TGGCCAATTT TTTTGTATTT TTAGTAGAGA TAGGGTTTCA CCATATTAGC CAGGATGGTC 61 TTGATCTGCT GACCTCATGA CCCACCCGCC TCGGCCTTCC AAAGTGCTGG GATTACAGGT 121 GTGAGCCACC GCGACCGGCC TGCTCAAGAT AATTTTTAGG GCTAACTATG ACATGAACCC 181 CAAAATTCCT GTCCTCTAGA TGGCAGAAAC CAAGATAAAG TATCCCCACA TGGCCACAAG 241 GTTAAGCTCT TATGGACACA AAACAAGGCA GAGAAATGTC ATTTGGCATT GGTTTCAGGG 301 ACCCATAGCA ACATTTGTAA ATGACCAGCC TGATGGGCTG GCTTGAAAAC TTGGCTTATA 361 GGCATCCTAA ACCCACGTTC TATCCCCTGA TACTCCCCTC TTCATTACAG AACAACAAAG 421 AAAGACAAAT TCTTAGCATA AAGTACACCA GATTTGCTAC AGCCTAAGAC TGGTCTGACA 481 AATCCTTTTT TTCTACTAAT CAGACCCTCG CAGAGAAGAC AAATAGTGGC ATTTACCGTT 541 TACACAACAT ATACAGAGAG AGAGAGACCA GAAACTTGGC TGGTAAGAAT TTCTTCCTCT 601 GGCCAGGAGC GGTGGCTCAC ACCTGTAATC TCAGCCCTTT GGGAGGCTGA GGCGGGTGGA 661 TCAGAAGGTC AAGAGATCCA GACCATCCTG GCCAACATGG TGAAACCCCG TCTCTACTAA 721 AAATACAAAA ATTAGCTGGA CATGGTGGTG GGCGCCTGTA GTCCCAGCTA CTCAGGAGGC 781 TGAGGCAGGA GAATTGCTTG AACCCAGGAG GTGGAGGTTG CAGTGAGCCT AGATCACGCC 841 ACTGCACTCC AGCCTGGCGA CACAGCGAGA CTCCGTCTCA AAAAAAAAAA TAATAAATAA 901 GAAAAGGAAA AAAAAGAATA CAACTCAGGA ACAGCCAAAT GGAGGAGATG CATGGGACAA 961 GGTTTAGTGG GGGGCTGCGG AGCTTCCTTG CCCTCTGCAG G
Analysis of the Whole Chromosome
Now that you can easily access the whole chromosome, you can analyze the data. This example shows one way to look at the GC content along the chromosome.
You extract blocks of 500000bp and calculate the GC content.
Calculate how many blocks to use.
numNT = numel(chr1.Data); blockSize = 500000; numBlocks = floor(numNT/blockSize);
One way to help MATLAB performance when working with large data sets is to "preallocate" space for data. This allows MATLAB to allocate enough space for all of the data rather than having to grow the array in small chunks. This will speed things up and also protect you from problems of the data getting too large to store. For more details on pre-allocating arrays, see: http://www.mathworks.com/support/solutions/data/1-18150.html?solution=1-18150
An easy way to preallocate an array is to use the zeros function.
ratio = zeros(numBlocks+1,1);
Loop over the data looking for C or G and then divide this number by the total number of A, T, C, and G. This will take about a minute to run.
A = nt2int('A'); C = nt2int('C'); G = nt2int('G'); T = nt2int('T'); for count = 1:numBlocks % calculate the indices for the block start = 1 + blockSize*(count-1); stop = blockSize*count; % extract the block block = chr1.Data(start:stop); % find the GC and AT content gc = (sum(block == G | block == C)); at = (sum(block == A | block == T)); % calculate the ratio of GC to the total known nucleotides ratio(count) = gc/(gc+at); end
The final block is smaller so treat this as a special case.
block = chr1.Data(stop+1:end); gc = (sum(block == G | block == C)); at = (sum(block == A | block == T)); ratio(end) = gc/(gc+at);
Plot of the GC Content for the Homo Sapiens Chromosome 1
xAxis = [1:blockSize:numBlocks*blockSize, numNT]; plot(xAxis,ratio) xlabel('Base pairs'); ylabel('Relative GC content'); title('Relative GC content of Homo Sapiens Chromosome 1')
The region in the center of the plot around 140Mbp is a large region of Ns.
seqdisp(chr1.Data(140000000:140001000))
ans = 1 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 61 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 121 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 181 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 241 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 301 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 361 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 421 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 481 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 541 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 601 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 661 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 721 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 781 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 841 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 901 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 961 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN N
Finding Regions of High GC Content
You can use find to identify regions of high GC content.
indices = find(ratio>0.5);
ranges = [(1 + blockSize*(indices-1)), blockSize*indices];
fprintf('Region %d:%d has GC content %f\n',[ranges ,ratio(indices)]')
Region 500001:1000000 has GC content 0.504629 Region 1000001:1500000 has GC content 0.595154 Region 1500001:2000000 has GC content 0.541260 Region 2000001:2500000 has GC content 0.593458 Region 2500001:3000000 has GC content 0.569676 Region 3000001:3500000 has GC content 0.585224 Region 3500001:4000000 has GC content 0.534905 Region 6000001:6500000 has GC content 0.556748 Region 9000001:9500000 has GC content 0.507248 Region 10500001:11000000 has GC content 0.516690 Region 11500001:12000000 has GC content 0.520740 Region 16000001:16500000 has GC content 0.504110 Region 17000001:17500000 has GC content 0.503346 Region 17500001:18000000 has GC content 0.516074 Region 22000001:22500000 has GC content 0.511540 Region 41500001:42000000 has GC content 0.501028 Region 53500001:54000000 has GC content 0.500928 Region 153000001:153500000 has GC content 0.517338 Region 154500001:155000000 has GC content 0.505378 Region 226000001:226500000 has GC content 0.508872 Region 226500001:227000000 has GC content 0.502120
If you want to remove the temporary file, you must first clear the memmapfile object.
clear chr1
delete(mmFilename)
Store