Binary, ASCII and Compression Algorithms

14 views (last 30 days)
Matlab2010
Matlab2010 on 11 Jul 2014
Edited: José-Luis on 14 Jul 2014
I have a large number (> 1E6) of ASCII files (myFile.txt) which contain time series data, all in the same format: timestamp, field 1, field 2,...,field 20. Each data entry is one row, tab separated. Each of the fields 2-20 is a double. The timestamp is string (HH:MM:SS.FFF). The files are each c. 5GB in size.
I wish to reduce the hard disk storage required. How can I do this?
My thoughts so far are
1. Convert the files to binary format. How can I do this? Is it by applying dec2bin.m? However this function seems to only take scalars. What would this look like?
2. Compress each file. Each file may be used independently of the others, thus I wish to compress individually. I know that differing approaches to compression work differently for different data structures. Given my data structure above, which is the best one to apply?
Given the importance of this, I would be happy calling other language files from inside matlab (eg C++). Any standard libraries/ third party tools that can be recommended?
3. Any other suggestions?
Finally, an important point is that I wish the user to be able to quickly load and access the data in each file - ie the bin2dec() call must be quick as must be the decompression.
thank you!
  3 Comments
Matlab2010
Matlab2010 on 14 Jul 2014
Edited: Matlab2010 on 14 Jul 2014
I don't want to use a database due to I/O costs.
The binary files would contain no text as I would convert the timestamps to java format (eg using datenum.m).
José-Luis
José-Luis on 14 Jul 2014
Edited: José-Luis on 14 Jul 2014
I would use a database. Which one is mostly down to personal preferences and constraints. I like mysql because it's free.
Depending on how your data looks like, you could use the netcdf: format. It has support to be read/written in Matlab. The same is true for hdf5 . These are sort of lightweight databases though.
IMO, io through a database would be faster than wading through the mountain of files you have, unless you plan on hard-coding file paths. I haven't tested it though so that's not a definite.

Sign in to comment.

Answers (1)

Star Strider
Star Strider on 11 Jul 2014
I would read them in as text files, save them as ‘.mat’ files (in the default binary format), then delete the text files. Since the ‘.mat’ files have a different suffix/extension, the prefix name can be the same as for the text file. See the documentation for save and load for details.
  2 Comments
Matlab2010
Matlab2010 on 11 Jul 2014
1. I would like to be able to access the data from Python and R as well as matlab.
2. Does compressing mat files help much? eg zip.m
Star Strider
Star Strider on 11 Jul 2014
Edited: Star Strider on 11 Jul 2014
  1. If you want to access the files from other applications, your best option would be to go with something other than .mat files, since to the best of my knowledge, those are MATLAB-specific. I’m not familiar with the file types Python and R can read and write, so you would need to find a common, space-efficient file format for all three applications.
  2. Compressing them would help. You probably have to go that route anyway, considering the sizes of the files.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!