Using Memory Mapped Files to Work With Whole Genome Data
Whole genomes are available for humans, mouse, rat, fugu, and several other model organisms. For many of these organisms one chromosome can be several hundred million base pairs long. Working with such large data sets can be challenging as you may run into limitations of the hardware and software that you are using. This example shows one way to work around these limitations in MATLAB.
Contents
Large Data Set Handling Issues
Solving technical computing problems that require processing and analyzing large amounts of data puts a high demand on your computer system. Large data sets take up significant memory during processing and can require many operations to compute a solution. It can also take a long time to access information from large data files.
Computer systems, however, have limited memory and finite CPU speed. Available resources vary by processor and operating system, the latter of which also consumes resources. For example:
32-bit processors and operating systems can address up to 2^32 = 4,294,967,296 = 4 GB of memory (also known as virtual address space). Windows XP and Windows 2000 allocate only 2 GB of this virtual memory to each process (such as MATLAB). On UNIX, the virtual memory allocated to a process is system-configurable and is typically around 3 GB. The application carrying out the calculation, such as MATLAB, can require storage in addition to the user task. The main problem when handling large amounts of data is that the memory requirements of the program can exceed that available on the platform. For example, MATLAB generates an "out of memory" error when data requirements exceed approximately 1.7 GB on Windows XP.
More details on memory management and large data sets can be found in this MATLAB Digest article or in the Memory Management Guide in our support site.
On a typical 32-bit machine, the maximum size of a single data set that you can work with in MATLAB is a few hundred MB, or about the size of a large chromosome. Memory mapping of files allows MATLAB to work around this limitation and enables you to work with very large data sets in an intuitive way.
Whole Genome Data Sets
The latest whole genome data sets can be downloaded from the Ensembl Web site. The data are provided in several formats. These are updated regularly as new sequence information becomes available. This example will use human DNA data stored in FASTA format. Chromosome 1 is (in the NCBI35 December 2005 release) a 64.5 MB compressed file. After uncompressing the file it is about 250MB. MATLAB uses 2 bytes per character, so if you read the file into MATLAB, it will require about 500MB of memory.
This example assumes that you have already downloaded and uncompressed the FASTA file into your local directory, change the name of the variable FASTAfilename if appropriate.
FASTAfilename = 'Homo_sapiens.NCBI35.dec.dna.chromosome.1.fa';
fileInfo = dir(FASTAfilename)
fileInfo =
name: 'Homo_sapiens.NCBI35.dec.dna.chromosome.1.fa'
date: '15-Nov-2005 09:48:56'
bytes: 249614947
isdir: 0
Memory Mapped Files
Memory mapping allows MATLAB to access data in a file as though it is in memory. You can use standard MATLAB indexing operations to access data. See the documentation for memmapfile for more details.
You could just map the FASTA file and access the data directly from there. However the FASTA format file includes new line characters. The memmapfile function treats these characters in the same way as all other characters. Removing these before memory mapping the file will make indexing operations simpler. Also, memory mapping does not work directly with character data so you will have to treat the data as 8-bit integers (uint8 class). The function nt2int in the Bioinformatics Toolbox can be used to convert character information into integer values. int2nt is used to convert back to characters.
First open the FASTA file and extract the header.
fidIn = fopen(FASTAfilename,'r');
header = fgetl(fidIn)
header = >1 dna:chromosome chromosome:NCBI35:1:1:245522847:1
Open the file to be memory mapped
[fullPath, filename, extension] = fileparts(FASTAfilename); mmFilename = [filename '.mm'] fidOut = fopen(mmFilename,'w');
mmFilename = Homo_sapiens.NCBI35.dec.dna.chromosome.1.mm
Read the FASTA file in blocks of 1MB, remove new line characters, convert to uint8, and write to the MM file.
newLine = sprintf('\n'); blockSize = 2^20; while ~feof(fidIn) % Read in the data charData = fread(fidIn,blockSize,'*char')'; % Remove new lines charData = strrep(charData,newLine,''); % Convert to integers intData = nt2int(charData); % Write to the new file fwrite(fidOut,intData,'uint8'); end
Close the files.
fclose(fidIn); fclose(fidOut);
The new file is about the same size as the old file but does not contain new lines or the header information.
mmfileInfo = dir(mmFilename)
mmfileInfo =
name: 'Homo_sapiens.NCBI35.dec.dna.chromosome.1.mm'
date: '19-Jan-2006 11:57:59'
bytes: 245522847
isdir: 0
Accessing the Data in the Memory Mapped File
The command memmapfile constructs a memmapfile object that maps the new file to memory. In order to do this, it needs to know the format of the file. The format of this file is simple, though much more complicated formats can be mapped.
chr1 = memmapfile(mmFilename, 'format', 'uint8')
chr1 =
Filename: 'C:\matlab\toolbox\bioinfo\biodemos\Homo_sapiens.NCBI35.dec.dna.chromosome.1.mm'
Writable: false
Offset: 0
Format: 'uint8'
Repeat: Inf
Data: 245522847x1 uint8 array
The MEMMAPFILE Object
The memmapfile object has various properties. Filename stores the full path to the file. Writable indicates whether or not the data can be modified. Note that if you do modify the data, this will also modify the original file. Offset allows you to specify the space used by any header information. Format indicates the data format. Repeat is used to specify how many blocks (as defined by Format) to map. This can be useful for limiting how much memory is used to create the memory map. These properties can be accessed in the same way as other MATLAB data. For more details see type help memmapfile or doc memmapfile.
chr1.Data(1:10)
ans =
4
1
1
2
2
2
4
1
1
2
You can access any region of the data using indexing operations.
chr1.Data(10000000:10000010)'
ans =
3 2 2 1 2 1 3 4 2 4 2
Remember that the nucleotide information was converted to integers. You can use int2nt to get the sequence information back.
int2nt(chr1.Data(10000000:10000010)')
ans = GCCACAGTCTC
Or use seqdisp to display the sequence.
seqdisp(chr1.Data(10000000:10001000)')
ans = 1 GCCACAGTCT CCCGAGTAGC TGGGATTACA GGGGTGCACC ACCATGCCCA GCTAATTTTT 61 GTTTTTCAGT AGAGATGGGG TTTTGCCATG TTGCCCAGGC TGGTTTTGAA CTCCTGACCT 121 CAGGTGATCC ACCTGCCTCT GCCTCCCAAA GTGCTGGGAT TACAGGCATG AGCCACCACG 181 GCTGGCAGAA TCACGTCTCA TCTCTAACTC CTCTCCTTCC TCCTCCCTCT CCCCGATTCT 241 GCGGCAGATA CACTAGGCTC CTCAGAGTTC CCGAAACACA CTGGCACACC CTTCCTCAGA 301 GTCCCAGGGC TCCCTGGATT TCTGCCTTAA ACTATTTTCC CAGAAATCTG TGTGGCTTGA 361 TTCTTTACTT CTTTCAGCTC CCCGCTGAGA CGTCACCTGG CCATTATTTA AAATAGTGGT 421 GTTTATTTCT AGCTACATAA TATGCTCAAA GGCAGTAAGT GGAACTGGGA TTCCAAACCC 481 TGATCTTCAT CTTCTTAGCA TTCCATGTTT CCCTGTGGAA CTCTTCTTTA AAGCTTATTT 541 AAGAATTCTA GCCGAGCAGG GTGGCTCATG TCTGTAATCC CAGCACTTTG GGAGGCTGAG 601 GTGAGAGGAT TGCTTGAGCC CAAGAGTTCG AGACCAACCT GGGCAACATA GTGAGACTTG 661 ATCTCTACAA AAAAATATTT AAAACATTTC CAGGCATGGT GGCATGTGCC TGTAGTCCCA 721 GCTATTCGAG AGGCTGAGAT AGGAGAATCA CTTGAGCCTG AAAGGTTGAG GCTACAGTGA 781 GCCATGATTG CACCACTGCA CTTCAGCCTG GGTGACAGAA TGAGATCCTG TCTCAAAAAA 841 AGAAAAAAAT TATAAATACA TAGTAGTATA TAGCTGGGTG TGGTGATGCA AGCCTGTAAT 901 TCCAGTTACT CAGGAGACTG AGGCAAGAGA ATTGCTTGAA CCCGGGAGTG GAGGTTGCAG 961 TGAGCTGAAA TCGTGCCACT GCTCTCCCCA ACCTGGGCGA C
Analysis of the Whole Chromosome
Now that you can easily access the whole chromosome, you can analyze the data. This example shows one way to look at the GC content along the chromosome.
You extract blocks of 500000bp and calculate the GC content.
Calculate how many blocks to use
numNT = numel(chr1.Data); blockSize = 500000; numBlocks = floor(numNT/blockSize);
One way to help MATLAB performance when working with large data sets is to "preallocate" space for data. This allows MATLAB to allocate enough space for all of the data rather than having to grow the array in small chunks. This will speed things up and also protect you from problems of the data getting too large to store. For more details on pre-allocating arrays, see: http://www.mathworks.com/support/solutions/data/1-18150.html?solution=1-18150
An easy way to preallocate an array is to use the zeros function.
ratio = zeros(numBlocks+1,1);
Note that some parts of the sequence are unknown or ambiguous, as indicated by the N symbol. These are ignored in the calculation but there are some 500000 bp regions that are made up entirely of Ns. These will give warnings about dividing by zero. So you can tell MATLAB to ignore these warnings.
ws = warning('off','MATLAB:divideByZero');
Loop over the data looking for C or G and then divide this number by the total number of A, T, C, and G. This will take about a minute to run.
A = nt2int('A'); C = nt2int('C'); G = nt2int('G'); T = nt2int('T'); for count = 1:numBlocks % calculate the indices for the block start = 1 + blockSize*(count-1); stop = blockSize*count; % extract the block block = chr1.Data(start:stop); % find the GC and AT content gc = (sum(block == G | block == C)); at = (sum(block == A | block == T)); % calculate the ratio of GC to the total known nucleotides ratio(count) = gc/(gc+at); end
The final block is smaller so treat this as a special case.
block = chr1.Data(stop+1:end); gc = (sum(block == G | block == C)); at = (sum(block == A | block == T)); ratio(end) = gc/(gc+at);
Reset the warning state to turn divide by zero warnings back on.
warning(ws);
Plot of the GC Content for the Homo Sapiens Chromosome 1
xAxis = [1:blockSize:numBlocks*blockSize, numNT]; plot(xAxis,ratio) xlabel('Base pairs'); ylabel('Relative GC content'); title('Relative GC content of Homo Sapiens Chromosome 1')
The region in the center of the plot around 140Mbp is a large region of Ns
seqdisp(chr1.Data(140000000:140001000))
ans = 1 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 61 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 121 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 181 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 241 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 301 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 361 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 421 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 481 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 541 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 601 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 661 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 721 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 781 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 841 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 901 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN 961 NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN N
Finding Regions of High GC Content
You can use find to identify regions of high GC content.
indices = find(ratio>0.5);
ranges = [(1 + blockSize*(indices-1)), blockSize*indices];
fprintf('Region %d:%d has GC content %f\n',[ranges ,ratio(indices)]')
Region 1000001:1500000 has GC content 0.606918 Region 1500001:2000000 has GC content 0.538524 Region 2000001:2500000 has GC content 0.590432 Region 2500001:3000000 has GC content 0.569760 Region 3000001:3500000 has GC content 0.584924 Region 3500001:4000000 has GC content 0.543873 Region 6000001:6500000 has GC content 0.556684 Region 9000001:9500000 has GC content 0.508438 Region 10500001:11000000 has GC content 0.516098 Region 11500001:12000000 has GC content 0.518680 Region 16000001:16500000 has GC content 0.512598 Region 17000001:17500000 has GC content 0.507808 Region 17500001:18000000 has GC content 0.509944 Region 18500001:19000000 has GC content 0.500580 Region 21500001:22000000 has GC content 0.503450 Region 22500001:23000000 has GC content 0.501886 Region 151500001:152000000 has GC content 0.513984 Region 153000001:153500000 has GC content 0.506792 Region 224500001:225000000 has GC content 0.523292
If you want to remove the temporary file, you must first clear the memmapfile object.
clear chr1
delete(mmFilename)