How to create tall datastore from multiple data parts?

2 views (last 30 days)
Alexandre Kaspar on 14 Nov 2016
Answered: Rick Amos on 29 Nov 2016
I would like to compute PCA on a large amount of data and thus use the tall array feature of the newest versions of Matlab. My data consists of multiple blocks of features that I gather from big images, i.e. blocks Di of size (Ni,d).
Let's say I have M such blocks and I want to compute PCA for all of them, i.e. something like
[coeff, score, latent] = pca([D0; D1; D2; ... ; Dn]);
but the data array [D0; D1; D2; ... ; Dn] does not fit into memory (several GB of data). Every block Di fits in memory by itself, but not their concatenation.
What is the best way to generate a datastore from these multiple blocks of data?
Note: I could compute pcacov using the eigen decomposition manually since the computation of the covariance matrix can be done using the sum of the outer products, which can be easily computed whatever the size of the data matrix, but I read PCA is more stable.

Rick Amos on 29 Nov 2016
A datastore can be created from a collection of folders and so the easiest way to achieve this is to place each block of data into its own folder using tall/write. The following code does both this as well as creating the datastore:
baseFolder = fullfile(pwd, 'MyFolder');
for ii = 1 : numBlocks
block = calculateBlock(ii);
subfolder = fullfile(baseFolder, num2str(ii, '%05i'));
write(subfolder, tall(block));
end
wildcardPattern = fullfile(baseFolder, '*');
ds = datastore(wildcardPattern);