Recommendation for Data Format for Long Term Data Storage

34 views (last 30 days)
I'm currently developing a custom class to store various measured and analytical data sets in a common format. The idea is to create a standard format that I can then write a whole suite of tools to analyze, plot, modify, etc whatever sets of data I pass it. Additionally, this format will have various pieces of metadata that I can populate to further describe the data and help with downstream analyses (some peices of metadata will be values or string/character arrays while others may be objects based on subclasses of this custom class). The hope is this approach allows me to decouple the input data format from the code and make things more maintainable and flexible long term.
I'm struggling though with how I should store these data sets for future use. I've considered the following options so far (I've included some pros/cons I have thought of as well).
1. Import the data into the custom class and store the data set objects in a .mat file I can easily load later.
a. Pro: Don't need to reimport the data
b. Con: Updates to class could cause compatibility issues I'll have to deal with some how.
c. Con: If I don't have the class in my path, the data won't load properly.
d. Con: Can't load only the metadata which could be really useful when there are lots of data sets to choose from.
2. Never save the imported data, always read it from the original provided data format.
a. Pro: No compatability issues
b. Con: Slower processing especially when number of data sets grows
c. Con: What do I do when I have to clean up the data? Do I have to export it to that same format once I clean it up? Do I save it to a different format?
3. Save the data to data storage format (e.g. HDF5 or CDF)
a. Pro: Decouples data storage format from custom class format allowing each to evolve or even be replaced based on need (i.e. more flexible).
b. Pro: Depending on format, can potentially take advantage of various compression and partial load strategies these formats allow. This could be especially useful when I have a lot of data sets and need to use somesort of database system to find specific subsets of data.
c. Con: More code to develop and maintain (i.e. export/import to/from data storage format).
I have three questions I'm hoping to get answers to:
1. Which storage solution should I go with? Please include your reasons since I'm not sure I've covered all the pros/cons.
2. Have a made any bad assumptions or conclusions about my potential choices?
3. Are there any other potential storage options I haven't considered?
Notes:
  • So far I haven't found anything online suggesting a particular or alternate path. I have come across some articles about data models and data classes but there's a lot to digest and it maybe sounds like overkill for the 10-20 people who are going to use this.
  • The data sets I'll be dealing with will be both small and very large. For the large data sets, I sometimes have problems loading certain data sets into memory but I've been able to get around it by going to a high performance computing cluster I have access to. I'm not sure if this is a viable long term strategy though.
  • I've considered using tables/tall arrays/time series but I'm not sure the Metadata/CustomProperties can fully meet the need my custom class is supposed to fill. Maybe I'm wrong though.
  4 Comments
Derek
Derek on 28 Mar 2024 at 15:15
@Stephen23, valid points on the MAT files and basic save approach. I did research (https://www.mathworks.com/matlabcentral/answers/636630-save-classdef-with-object-in-mat-file) your alternate suggestion and it does have merit. The class I'm creating though has quite a few subclasses and methods to handle different data types so it might be more headache then it's worth. I think I'll keep it in mind though in the event someone doesn't have access to the classdef namespace.
@Chunru, I have come across NetCDF when looking at the other general data formats but have not looked into it much more then the rest. All of them have lots of details to digest. Any particular reason you would suggest it over the others?
Chunru
Chunru on 2 Apr 2024 at 10:06
NetCDF is a popular and general format. It is self desciptive and well supported in many programming languages. For matlab use, there is higher level functions so that one does not have to go into too much details.

Sign in to comment.

Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!