How to efficiently read nested complex data structures from binary files

24 views (last 30 days)
Hi
I am trying to implement a reader for strucrture data generated with c++. File have all "simple" data types (int, uint, float, etc) plus structures in it. Some array lenghts are "fixed", and for others ("non-fixed") I first need to read the size of the array, and then read the array. There are arrays of structures both with fixed and non-fixed lenghts, there are ensted structures, etc. I am following two approeaches, memmapfile and typecast.
The structures are not optimized to be read with memmapfile in the sense that a structure may have a few "simple" datatype fields (int, float, etc), then a struct, then more simple datatype fields, then another struct, then more simple datatype fields, etc. Here you have an example:
Things got even more complicated: the reader must read different types of files (dozens), so I first read the structure of how the binary file is structured (done), and then, I read the binary data using that structure.
I have a file contaning first a "non-fixed" array of elements (type 1) and a second "non-fxed" array of elements (type 2). First I need to read how many type 1 elements I ahve, read them, read how many type 2 elements I have, and read the. The structure looks like:
  • Number N1 of element1 elements (ii index) : 1x1 uint32
  • element1(ii).field1 : 1x1 struct
  • element1(ii).field1.subfield1 : 1x1 int32
  • element1(ii).field1.subfield2 : 1x1 uint32
  • element1(ii).field1.subfield3 : 1x1 uint32
  • element1(ii).field2 : 1x1 double
  • element1(ii).field3 : 20x1 single
  • element1(ii).field4 : 5x1 uint64
  • element1(ii).field5 : 1x1 struct
  • element1(ii).field5.subfield1 : 1x1 uint8
  • element1(ii).field5.subfield2 : 1x1 single
  • element1(ii).field5.subfield3 : 1x1 uint64
  • Number N2 of element2 elements (ii index) : 1x1 uint32
  • element2(ii).field1 = 1x1 uint32
  • element2(ii).field2 = 1x1 single
  • element2(ii).field3 = 1x1 uint32 (number M1 of field4 subelements (jj index))
  • element2(ii).field4(jj)
  • element2(ii).field4(jj).subfield1 = 1x1 float
  • element2(ii).field4(jj).subfield2 = 1x1 uint32
  • element2(ii).field4(jj).subfield3 = 1x1 uint16
So far I analyze when consecutive variables are "simple datatypes", and read them all together using memmapfile. So, for example, in the example, in pseudocode, I do:
  1. use memmapfile to read number N1
  2. for ii=1:N1
  3. use memmapfile to read the 3 subfields of field 1
  4. use memmapfile to read field2 to field 4
  5. use memmapfile to read the 3 subfields of field 5
  6. end
  7. use memmapfile to read number N2
  8. for ii=1:N2
  9. use memmapfile to read field1 to field 3 (field 3 is M1)
  10. for jj=1:M1
  11. use memmapfile to read subfields1 to 3 fort field4
  12. end
  13. end
I knmow I could read M1 elements in lines 10-12 instead of a loop, but since structs can be subnested and/or with unfixed lenght arrays, I chose this way.
A general question:
  1. Is memmapfile the optimum way to read these kind of structures?
  2. I thought typecast could improve the reading since the file is totatlly put in memory, but is way slower. Any other possible approach?
A couple of particular questions:
  1. In the first part of the code, is there a smart way to integrate the reading in lines 3 to 5 in a single read? This case is "simple", subfields in field1 could also had nested fields, etc.
  2. Is there anything I can do to optimize this? I was thinking on generate a huge structure to be read in memmapfile with variable name like field1__subfield1, field1__subfield2, field1__subfield3, field2, field3, field4__subfield1, field4__subfield2, field4__subfield3 so I can read N1 elements at the same time, but I am afraid how complicated could be for way more nested structures
Thank you

Answers (1)

Dheeraj
Dheeraj on 25 Sep 2023
Hi,
I understand that you are trying to read your data usingmememmapfile
Reading complex binary data structures like the one you described can be challenging and may require a combination of approaches depending on the specific structure of the data and your performance requirements.
Addressing your questions below.
1. Is memmapfile the optimum way to read these kinds of structures?
memmapfile can be a good choice for reading binary data from files when you need to access specific parts of the data without loading the entire file into memory. In your case, since you have complex nested structures and arrays with variable lengths, memmapfile might not be the most efficient approach. Reading data field by field and element by element as you're currently doing can be suboptimal in terms of performance, especially when dealing with nested structures and dynamic array sizes.
2. Is there a smart way to integrate the reading of fields in lines 3 to 5 in a single read?
Yes, you can optimize the reading of these fields by reading them in a single operation using memmapfile. You can create a memmapfile object for the entire file and then use indexing to access the specific elements you need. Doing this reduces number of file-reads thus improving performance in case of large datasets.
3. Optimizing for nested structures and variable-length arrays
consider using serialization libraries like Protocol Buffers (protobuf) or Apache Avro for writing and reading complex binary data. These libraries provide a standardized way to define and serialize complex data structures, making it easier to handle nested structures and variable-length arrays.
You can also find Fast serialize/deseriali​ze” functions through MATLAB’s File Exchange using the below link.
Hope this helps!

Categories

Find more on Cell Arrays in Help Center and File Exchange

Products


Release

R2018b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!