How to use memmapfile for a very large structured binary file

6 views (last 30 days)
Hello:
I need to process a 62 GB structured binary file written from a 24 h simulation. The structure of the file is as follows:
ft(1).length = 1; ft(1).type = 'integer*4'; ft(1).name = 'VehID';
ft(2).length = 1; ft(2).type = 'real*4'; ft(2).name = 'Time';
ft(3).length = 1; ft(3).type = 'integer*4'; ft(3).name = 'Longitude';
ft(4).length = 1; ft(4).type = 'integer*4'; ft(4).name = 'Latitude';
ft(5).length = 1; ft(5).type = 'integer*2'; ft(5).name = 'Heading';
ft(6).length = 1; ft(6).type = 'integer*4'; ft(6).name = 'Segment';
ft(7).length = 1; ft(7).type = 'integer*2'; ft(7).name = 'Dir';
ft(8).length = 1; ft(8).type = 'integer*4'; ft(8).name = 'Lane';
ft(9).length = 1; ft(9).type = 'real*4'; ft(9).name = 'Offset';
ft(10).length = 1;ft(10).type = 'real*4'; ft(10).name = 'Distance';
ft(11).length = 1;ft(11).type = 'real*4'; ft(11).name = 'Speed';
ft(12).length = 1;ft(12).type = 'real*4'; ft(12).name = 'Acceleration';
I am able to read this file using readfields with the format above but it is taking forever to go through its 1,506,979,651 records. I would like to partition this file in 96 files based on the value of 'Time', which covers 24 hours (15 min increments -> 96 files), and keep only VehID, Time, Distance, Speed, and Acceleration. After extensive readings (I am still learning Matlab), I understand memmapfile would be a good way to go, but I am unable to make that command work. I would need help to write the appropriate memmapfile statement (especially the format) so I can process this file efficiently. Thank you for your help,
JDS
  1 Comment
per isakson
per isakson on 1 Dec 2014
Edited: per isakson on 1 Dec 2014
The free (as in beer) program GSplit might be an alternative to split the file. I was once able to use it successfully minutes after downloading.

Sign in to comment.

Accepted Answer

per isakson
per isakson on 1 Dec 2014
Edited: per isakson on 19 May 2015
This gave me a chance to try a complicated format. Result:
filespec = 'usgsdems.dat'; % A sample file I found in the Map Toolbox
n_repeat = 24*60/15;
nday = 1;
N = (nday-1) * n_repeat * sum([ 4, 4, 4, 4, 2, 4, 2, 4, 4, 4, 4, 4 ]);
%
mmp = memmapfile( filespec ...
, 'Offset' , N ...
, 'Format', {
'int32' , [1,1], 'VehID'
'single', [1,1], 'Time'
'int32' , [1,1], 'Longitude'
'int32' , [1,1], 'Latitude'
'int16' , [1,1], 'Heading'
'int32' , [1,1], 'Segment'
'int16' , [1,1], 'Dir'
'int32' , [1,1], 'Lane'
'single', [1,1], 'Offset'
'single', [1,1], 'Distance'
'single', [1,1], 'Speed'
'single', [1,1], 'Acceleration'
} ...
, 'Repeat', n_repeat );
>> mmp.Data(1).VehID
ans =
1701994860 % garbage but indicates the syntax is correct
>> mmp.Data(2).VehID
ans =
538976313
>> mmp.Data(n_repeat).VehID
ans =
538976288
However,
>> mmp.Data(2:3).VehID
Error using memmapfile/subsref (line 782)
A subscripting operation on the Data field attempted to create a comma-
separated list. The memmapfile class does not support the use of comma-
separated lists when subscripting.
&nbsp
"and keep only VehID, Time, Distance, Speed, and Acceleration"
AFAIK: The new files must be written line by line. Include only the fields, which shall be kept.
I'm not convinced the process will be fast.

More Answers (1)

Jean-Daniel Saphores
Jean-Daniel Saphores on 2 Dec 2014
Thank you. This format works and this code is much faster than what I was doing before. Since I am not using variables Longitude...Offset, I combined them to drop them from my structure array. Your help is much appreciated.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!