Loading large binary files in Matlab, quickly

I have some pretty massive data files (256 channels, on the order of 75-100 million samples) in int16 format. It is written in flat binary format, so the structure is something like: CH1S1,CH2S1,CH3S1 ... CH256S1,CH1S2,CH2S2,...
I need to read in each channel separately, filter and offset correct it, then save. My current bottleneck is loading each channel, which takes about 7-8 minutes... scale that up 256 times, and I'm looking at nearly 30 hours just to load the data! I am trying to intelligently use fread, to skip bytes as I read each channel; I have the following code in a loop over all 256 channels to do this:
offset = i - 1;
fseek(fid,offset*2,'bof');
dat = fread(fid,[1,nSampsTotal],'*int16',(nChan-1)*2);
Reading around, this is typically the fastest way to load parts of a large binary file, but is the file simply too large to do this any faster? Any suggestions would be much appreciated!
System details: MATLAB 2017a, Windows 7, 64bit

4 Comments

How fast is this?
tic
fread(fid, [256 Inf], '*int16')
toc
Test it on a smaller data set first. Do you have 256 x 100 Million data? so 256 x 2 byte x 100E6 = 51 GB? If so, it'll require a lot of RAM... Or if you have 100 Million data total (0.2GB), then it should be fast to load.
Josh
Josh on 20 Aug 2018
Edited: Josh on 20 Aug 2018
Yes, the files are on that order of size (the one I'm currently testing is 37GB). I have tested this code on smaller recordings, and loading the data still takes some time (~15 seconds for a file with around 2 million samples per channel).
Because I don't have that much RAM, I'm currently loading only each channel, performing the relevant processing on it, then saving it out to its own file.
I just ran the code you sent to load the entire file... it ran for a while and returned an out of memory error, which makes sense, as I don't have 37GB RAM!
How much RAM do you actually have? Sounds like the performance hit is probably that you're running into actually being swapped in/out of virtual memory; fread is pretty quick for straight data transfer to/from memory.
Is the processing required dependent upon having the whole timeseries in memory or can you do it piecewise on each channel?
You may just have a system limitation here...
Josh
Josh on 20 Aug 2018
Edited: Josh on 20 Aug 2018
We have 32GB installed, but it is a shared computer, so the availability varies.
For the processing, I only need one channel at a time, but for the filtering and offset correction I'm doing it's necessary to have the entire timeseries per channel, to avoid filtering artifacts that might arise from splitting the timeseries.
I'm a bit confused about the RAM allocation, though. As I'm only trying to load in a subset of the of data (using the "skip" parameter in fread), it should definitely be doable from a RAM standpoint... (for the 37GB file I'm testing now, 1 channel out of the 256 should only be 149MB). Unless the 'skip' function of fread allocates memory in a way that I don't know of?

Sign in to comment.

 Accepted Answer

Seems like you have to use stream processing. Essentially load N frames of data for 256 channels, do the processing, save the frame, and repeat until done. Trying to do channel by channel by skipping 256 channel x 2 bytes seems slow. Here are some example for how to set that up.
The other option is to buy >64 GB RAM.

7 Comments

Josh
Josh on 20 Aug 2018
Edited: Josh on 20 Aug 2018
My initial instinct was to do something like this... load a chunks of temporal data over all the channels, process and then move to the next chunk. Unfortunately, my processing involves filtering, which will introduce discontinuities if performed on chunks of data in time.
Is there an easy way to implement this in this toolbox? Maybe using padded data frames to reduce edge artifacts? I had considered coding this up myself if I can't get the current implementation to work...
Do you require all the temporal data, or just N frames? To fix discontinuity, you need to load extra data. Often, people use a moving frame average. So if you want to average 10 frames, then you'd load something like 20 frames, take the average of the first 10 frames. Then load the next 10 frames, etc. Hard to show it here conceptually:
---------- %load 10 frame, but take average first 5
---------- %Load next 5 frames, take average for previous 5
---------- %load next 5 frames, take average for previous 5
*************** %Average Values filled in thus far. No discontinuity.
Yes, I'm familiar with the concept, but was hoping to avoid it, as I thought loading per channel more efficient for filtering (because convolutions are really fast). I'm starting to think that's not the case...
Loading files in frames chunks versus by channel (and skipping) will give you ~10 fold increase in speed.
FID = fopen('test.dat', 'w');
A = randi([0 255], 256, 1000, 'int16');
fwrite(FID, A, 'int16');
fclose(FID);
FID = fopen('test.dat', 'r');
tic
Data1 = zeros(256, 1000, 'int16');
for c = 1:256
fseek(FID, (c-1)*2, 'bof');
Data1(c, :) = fread(FID, [1 1000], '*int16', 255*2);
end
toc %0.1769 s
tic
fseek(FID, 0, 'bof');
Data2 = fread(FID, [256 1000], '*int16');
toc %0.0179 s
How about splitting your files into 256 smaller files, processing them as individual channel files, and then joining them at the end?
%file splitter
FID = fopen('test.dat', 'r');
FID_List = cell(256, 1);
for j = 1:256
FID_List{j} = fopen(sprintf('CH_%d.dat', j), 'w');
end
F = 100;
for n = 1:ceil(1000/F)
F = max(0, mod(990-n*F, F));
Data = fread(FID, [256 F], '*int16');
for k = 1:256
fwrite(FID_List{k}, Data(k, :), 'int16');
end
end
fclose all
%To see channel 1 data
FID = fopen('CH_1.dat', 'r')
B = fread(FID, 'int16')
Josh
Josh on 21 Aug 2018
Edited: Josh on 21 Aug 2018
I just tested your second code snippet with my data and it was MUCH faster. Just 8 minutes to read and write the 37GB data set. Thanks!
Nice! Glad it worked!
Hi Livio, to get an answer for your problem, please create a separate Question post instead of responding to this thread that is closed (answer is accepted).
Also, in your new Question post, format your code by selecting your code and pushing the {}Code button.
this is how to format code
for j = 1
end

Sign in to comment.

More Answers (0)

Categories

Find more on Scripts in Help Center and File Exchange

Asked:

on 20 Aug 2018

Commented:

on 12 Sep 2018

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!