Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Thread Subject:
Speed up large file reading

Subject: Speed up large file reading

From: Benjamin Kraus

Date: 7 Jun, 2012 02:08:07

Message: 1 of 1

I'm trying to speed up two pure m-file functions that I wrote for reading pieces of data from a binary data file. The data files range in size from about 800MB to as large as 2GB+.

I have the functions working, but they are horribly slow. I think this is mostly due to the way the file is organized, forcing a lot of separate calls to fread, rather than one call to read a large chunk of the file. The majority of the file (and hence the reading time) is millions of "data blocks". Each data block has a fixed size header, followed by variable size amount of data. The data block header has the following definition in C:

struct DataBlockHeader
{
    short Type; // Data type; 1=spike, 4=Event, 5=continuous
    unsigned short MSBTimestamp; // Upper 8 bits of the 40 bit timestamp
    unsigned long LSBTimeStamp; // Lower 32 bits of the 40 bit timestamp
    short Channel;
    short Unit;
    short NumberOfWaveforms;
    short NumberOfWordsInWaveform;
}; // 16 bytes

The "NumberOfWaveforms" and "NumberOfWordsInWaveform" multiplied together gives the size of the data following the header.

I'm writing two different functions. The first is supposed to simply count the number of data blocks of each 'Type', 'Channel', and 'Unit' (this information is *not* available in the file header, nor is the order of the data predictable in any way). The second function is to extract all the data that has a specific 'Type', and has a 'Channel' and 'Unit' within a supplied list.

Because of the variable size data blocks, and a different meanings of 'Channel' and 'Unit' depending on the 'Type', my implementation of the first function looks something like this:

count1 = zeros(96,26); % These sizes are determined from file header
count4 = zeros(255,1);
count5 = zeros(64,1);
dsz = 0; % Data block size
dh = fread(fid, 8, '*short');
while(~feof(fid))
  if(dh(dzs+1)==1);
    count1(dh(dsz+5),dh(dsz+6))=count1(dh(dsz+5),dh(dsz+6))+1;
  elseif(dh(dzs+1)==4);
    count4(dh(dsz+5))=count4(dh(dsz+5))+1;
  elseif(dh(dzs+1)==5);
    count5(dh(dsz+5))=count5(dh(dsz+5))+1;
  end
  dsz = dh(dzs+7)*dh(dsz+8);
  dh = fread(fid, dsz+8, '*short');
end

The company that created this format (and wrote the software producing the data) has a (Windows only) software utility that can do this on the order of seconds. They also distribute a (closed source, Windows only, and extremely buggy) MEX library that can do this in on the order of seconds (it crashed MATLAB three times while I was trying to take that measurement). My function takes on the order of minutes.
For example, on one unusually small file I've been using for testing (141MB), it took their MEX library about 2 seconds, their standalone GUI client about 2 seconds, and my function about 60 seconds.

Can anybody think of ways to improve the execution time of these functions so that my function behaves at least on the same order of magnitude as the closed source versions? I'm trying to avoid a MEX file implementation (I'm trying to make this readily cross-platform, and my C is much more rusty than my MATLAB, and I want to avoid some of the bugginess of the closed source version), but I'll go that route if necessary.

- Ben

Tags for this Thread

No tags are associated with this thread.

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Contact us