Quickly load and add big matrices

10 views (last 30 days)
tony
tony on 22 Apr 2016
Answered: Jan on 22 Apr 2016
Hi, I have a bunch of big matrices in txt format (about 7GB for each) and I want to add them up quickly. I find that it takes much time to load them and then do the addition in Matlab. First, I use the following command to save the matrix in Matlab workspace:
save(filename,'bigdata','-v7.3')
Then I want to load each one of these back to Matlab and then add them up. Here's rough estimation of time needed:
~1min to load each matrix
2s to add them up.
But I have a bunch of these big matrices that may take over a week to load and add up, even using parallel sessions. I'm wondering if there's any efficient way to do this job? It seems that it has spent much time in loading. If this can't be solved in Matlab, is there any possible way outside Matlab that I can try? Any suggestion is greatly appreciated. Thanks!
ps.I'm using Matlab 2012b in Linux server.

Answers (2)

Alessandro Masullo
Alessandro Masullo on 22 Apr 2016
The problem with loading big data is not Matlab, but the speed of you hard disk. The only way to load data in a quicker way is to use a higher speed disk (like SSD)

Jan
Jan on 22 Apr 2016
Storing large numerical array in text format is a really bad idea. The text format is useful, if a human should read and edit the data. But how can read 7GB of numbers?!
So better store them in binary format using fwrite. The MAT-files in v7.3 format compress the data. This might be faster, if the harddisk is slow and the reading is the bottleneck. But decompression requires time also. So if it is a local disk or even a SSD, an uncompressed binary format will be more efficient.
How much RAM do you have? Adding two 7GB matrices requires 21 GB of free RAM. If the values are stored in the virtual memory, they are actually stored on the hard disk again.
If you really want to add the arrays only, you can process the data block-wise: Read 8'000 elements from both arrays, add them and write them to the output (again: preferrably in binary format). This reduces the RAM consumption and the blocks match even into the processor cache.
A parallel processing will not help, because RAM and the hard disk access are the bottle necks.

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!