Using Memory Mapped Files to Work With Whole Genome Data

Whole genomes are available for humans, mouse, rat, fugu, and several other model organisms. For many of these organisms one chromosome can be several hundred million base pairs long. Working with such large data sets can be challenging as you may run into limitations of the hardware and software that you are using. This example shows one way to work around these limitations in MATLAB.

Contents

Large Data Set Handling Issues

Solving technical computing problems that require processing and analyzing large amounts of data puts a high demand on your computer system. Large data sets take up significant memory during processing and can require many operations to compute a solution. It can also take a long time to access information from large data files.

Computer systems, however, have limited memory and finite CPU speed. Available resources vary by processor and operating system, the latter of which also consumes resources. For example:

32-bit processors and operating systems can address up to 2^32 = 4,294,967,296 = 4 GB of memory (also known as virtual address space). Windows XP and Windows 2000 allocate only 2 GB of this virtual memory to each process (such as MATLAB). On UNIX, the virtual memory allocated to a process is system-configurable and is typically around 3 GB. The application carrying out the calculation, such as MATLAB, can require storage in addition to the user task. The main problem when handling large amounts of data is that the memory requirements of the program can exceed that available on the platform. For example, MATLAB generates an "out of memory" error when data requirements exceed approximately 1.7 GB on Windows XP.

More details on memory management and large data sets can be found in this MATLAB Digest article or in the Memory Management Guide in our support site.

On a typical 32-bit machine, the maximum size of a single data set that you can work with in MATLAB is a few hundred MB, or about the size of a large chromosome. Memory mapping of files allows MATLAB to work around this limitation and enables you to work with very large data sets in an intuitive way.

Whole Genome Data Sets

The latest whole genome data sets can be downloaded from the Ensembl Web site. The data are provided in several formats. These are updated regularly as new sequence information becomes available. This example will use human DNA data stored in FASTA format. Chromosome 1 is (in the NCBI35 December 2005 release) a 64.5 MB compressed file. After uncompressing the file it is about 250MB. MATLAB uses 2 bytes per character, so if you read the file into MATLAB, it will require about 500MB of memory.

This example assumes that you have already downloaded and uncompressed the FASTA file into your local directory, change the name of the variable FASTAfilename if appropriate.

FASTAfilename = 'Homo_sapiens.NCBI35.dec.dna.chromosome.1.fa';
fileInfo = dir(FASTAfilename)
fileInfo = 
     name: 'Homo_sapiens.NCBI35.dec.dna.chromosome.1.fa'
     date: '15-Nov-2005 09:48:56'
    bytes: 249614947
    isdir: 0

Memory Mapped Files

Memory mapping allows MATLAB to access data in a file as though it is in memory. You can use standard MATLAB indexing operations to access data. See the documentation for memmapfile for more details.

You could just map the FASTA file and access the data directly from there. However the FASTA format file includes new line characters. The memmapfile function treats these characters in the same way as all other characters. Removing these before memory mapping the file will make indexing operations simpler. Also, memory mapping does not work directly with character data so you will have to treat the data as 8-bit integers (uint8 class). The function nt2int in the Bioinformatics Toolbox can be used to convert character information into integer values. int2nt is used to convert back to characters.

First open the FASTA file and extract the header.

fidIn = fopen(FASTAfilename,'r');
header = fgetl(fidIn)
header =
>1 dna:chromosome chromosome:NCBI35:1:1:245522847:1

Open the file to be memory mapped

[fullPath, filename, extension] = fileparts(FASTAfilename);
mmFilename = [filename '.mm']
fidOut = fopen(mmFilename,'w');
mmFilename =
Homo_sapiens.NCBI35.dec.dna.chromosome.1.mm

Read the FASTA file in blocks of 1MB, remove new line characters, convert to uint8, and write to the MM file.

newLine = sprintf('\n');
blockSize = 2^20;
while ~feof(fidIn)
    % Read in the data
    charData = fread(fidIn,blockSize,'*char')';
    % Remove new lines
    charData = strrep(charData,newLine,'');
    % Convert to integers
    intData = nt2int(charData);
    % Write to the new file
    fwrite(fidOut,intData,'uint8');
end

Close the files.

fclose(fidIn);
fclose(fidOut);

The new file is about the same size as the old file but does not contain new lines or the header information.

mmfileInfo = dir(mmFilename)
mmfileInfo = 
     name: 'Homo_sapiens.NCBI35.dec.dna.chromosome.1.mm'
     date: '19-Jan-2006 11:57:59'
    bytes: 245522847
    isdir: 0

Accessing the Data in the Memory Mapped File

The command memmapfile constructs a memmapfile object that maps the new file to memory. In order to do this, it needs to know the format of the file. The format of this file is simple, though much more complicated formats can be mapped.

chr1 = memmapfile(mmFilename, 'format', 'uint8')
chr1 =
    Filename: 'C:\matlab\toolbox\bioinfo\biodemos\Homo_sapiens.NCBI35.dec.dna.chromosome.1.mm'
    Writable: false
      Offset: 0
      Format: 'uint8'
      Repeat: Inf
        Data: 245522847x1 uint8 array

The MEMMAPFILE Object

The memmapfile object has various properties. Filename stores the full path to the file. Writable indicates whether or not the data can be modified. Note that if you do modify the data, this will also modify the original file. Offset allows you to specify the space used by any header information. Format indicates the data format. Repeat is used to specify how many blocks (as defined by Format) to map. This can be useful for limiting how much memory is used to create the memory map. These properties can be accessed in the same way as other MATLAB data. For more details see type help memmapfile or doc memmapfile.

chr1.Data(1:10)
ans =
    4
    1
    1
    2
    2
    2
    4
    1
    1
    2

You can access any region of the data using indexing operations.

chr1.Data(10000000:10000010)'
ans =
    3    2    2    1    2    1    3    4    2    4    2

Remember that the nucleotide information was converted to integers. You can use int2nt to get the sequence information back.

int2nt(chr1.Data(10000000:10000010)')
ans =
GCCACAGTCTC

Or use seqdisp to display the sequence.

seqdisp(chr1.Data(10000000:10001000)')
ans =
   1  GCCACAGTCT CCCGAGTAGC TGGGATTACA GGGGTGCACC ACCATGCCCA GCTAATTTTT
  61  GTTTTTCAGT AGAGATGGGG TTTTGCCATG TTGCCCAGGC TGGTTTTGAA CTCCTGACCT
 121  CAGGTGATCC ACCTGCCTCT GCCTCCCAAA GTGCTGGGAT TACAGGCATG AGCCACCACG
 181  GCTGGCAGAA TCACGTCTCA TCTCTAACTC CTCTCCTTCC TCCTCCCTCT CCCCGATTCT
 241  GCGGCAGATA CACTAGGCTC CTCAGAGTTC CCGAAACACA CTGGCACACC CTTCCTCAGA
 301  GTCCCAGGGC TCCCTGGATT TCTGCCTTAA ACTATTTTCC CAGAAATCTG TGTGGCTTGA
 361  TTCTTTACTT CTTTCAGCTC CCCGCTGAGA CGTCACCTGG CCATTATTTA AAATAGTGGT
 421  GTTTATTTCT AGCTACATAA TATGCTCAAA GGCAGTAAGT GGAACTGGGA TTCCAAACCC
 481  TGATCTTCAT CTTCTTAGCA TTCCATGTTT CCCTGTGGAA CTCTTCTTTA AAGCTTATTT
 541  AAGAATTCTA GCCGAGCAGG GTGGCTCATG TCTGTAATCC CAGCACTTTG GGAGGCTGAG
 601  GTGAGAGGAT TGCTTGAGCC CAAGAGTTCG AGACCAACCT GGGCAACATA GTGAGACTTG
 661  ATCTCTACAA AAAAATATTT AAAACATTTC CAGGCATGGT GGCATGTGCC TGTAGTCCCA
 721  GCTATTCGAG AGGCTGAGAT AGGAGAATCA CTTGAGCCTG AAAGGTTGAG GCTACAGTGA
 781  GCCATGATTG CACCACTGCA CTTCAGCCTG GGTGACAGAA TGAGATCCTG TCTCAAAAAA
 841  AGAAAAAAAT TATAAATACA TAGTAGTATA TAGCTGGGTG TGGTGATGCA AGCCTGTAAT
 901  TCCAGTTACT CAGGAGACTG AGGCAAGAGA ATTGCTTGAA CCCGGGAGTG GAGGTTGCAG
 961  TGAGCTGAAA TCGTGCCACT GCTCTCCCCA ACCTGGGCGA C                    

Analysis of the Whole Chromosome

Now that you can easily access the whole chromosome, you can analyze the data. This example shows one way to look at the GC content along the chromosome.

You extract blocks of 500000bp and calculate the GC content.

Calculate how many blocks to use

numNT = numel(chr1.Data);
blockSize = 500000;
numBlocks = floor(numNT/blockSize);

One way to help MATLAB performance when working with large data sets is to "preallocate" space for data. This allows MATLAB to allocate enough space for all of the data rather than having to grow the array in small chunks. This will speed things up and also protect you from problems of the data getting too large to store. For more details on pre-allocating arrays, see: http://www.mathworks.com/support/solutions/data/1-18150.html?solution=1-18150

An easy way to preallocate an array is to use the zeros function.

ratio = zeros(numBlocks+1,1);

Note that some parts of the sequence are unknown or ambiguous, as indicated by the N symbol. These are ignored in the calculation but there are some 500000 bp regions that are made up entirely of Ns. These will give warnings about dividing by zero. So you can tell MATLAB to ignore these warnings.

ws = warning('off','MATLAB:divideByZero');

Loop over the data looking for C or G and then divide this number by the total number of A, T, C, and G. This will take about a minute to run.

A = nt2int('A'); C = nt2int('C'); G = nt2int('G'); T = nt2int('T');

for count = 1:numBlocks
    % calculate the indices for the block
    start = 1 + blockSize*(count-1);
    stop = blockSize*count;
    % extract the block
    block = chr1.Data(start:stop);
    % find the GC and AT content
    gc = (sum(block == G | block == C));
    at = (sum(block == A | block == T));
    % calculate the ratio of GC to the total known nucleotides
    ratio(count) = gc/(gc+at);
end

The final block is smaller so treat this as a special case.

block = chr1.Data(stop+1:end);
gc = (sum(block == G | block == C));
at = (sum(block == A | block == T));
ratio(end) = gc/(gc+at);

Reset the warning state to turn divide by zero warnings back on.

warning(ws);

Plot of the GC Content for the Homo Sapiens Chromosome 1

xAxis = [1:blockSize:numBlocks*blockSize, numNT];
plot(xAxis,ratio)
xlabel('Base pairs');
ylabel('Relative GC content');
title('Relative GC content of Homo Sapiens Chromosome 1')

The region in the center of the plot around 140Mbp is a large region of Ns

seqdisp(chr1.Data(140000000:140001000))
ans =
   1  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN
  61  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN
 121  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN
 181  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN
 241  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN
 301  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN
 361  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN
 421  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN
 481  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN
 541  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN
 601  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN
 661  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN
 721  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN
 781  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN
 841  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN
 901  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN
 961  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN N                    

Finding Regions of High GC Content

You can use find to identify regions of high GC content.

indices = find(ratio>0.5);
ranges = [(1 + blockSize*(indices-1)), blockSize*indices];
fprintf('Region %d:%d has GC content %f\n',[ranges ,ratio(indices)]')
Region 1000001:1500000 has GC content 0.606918
Region 1500001:2000000 has GC content 0.538524
Region 2000001:2500000 has GC content 0.590432
Region 2500001:3000000 has GC content 0.569760
Region 3000001:3500000 has GC content 0.584924
Region 3500001:4000000 has GC content 0.543873
Region 6000001:6500000 has GC content 0.556684
Region 9000001:9500000 has GC content 0.508438
Region 10500001:11000000 has GC content 0.516098
Region 11500001:12000000 has GC content 0.518680
Region 16000001:16500000 has GC content 0.512598
Region 17000001:17500000 has GC content 0.507808
Region 17500001:18000000 has GC content 0.509944
Region 18500001:19000000 has GC content 0.500580
Region 21500001:22000000 has GC content 0.503450
Region 22500001:23000000 has GC content 0.501886
Region 151500001:152000000 has GC content 0.513984
Region 153000001:153500000 has GC content 0.506792
Region 224500001:225000000 has GC content 0.523292

If you want to remove the temporary file, you must first clear the memmapfile object.

clear chr1
delete(mmFilename)

Provide feedback on this example.