Asked by David Santos
on 12 Aug 2019

I'm trying to calculate a percentile of a lot of files (25000 or even more) containing 4x1 cell, representing 4 maps or 1483x2824 matrixes.

I'm using tall arrays following indications of Percentiles of Tall Matrix Along Different Dimensions:

tic

%start local pool for mutithreading

c=parcluster('local');

c.NumWorkers=20;

parpool(c, c.NumWorkers);

folder='/home/temporal2/dsantos/mat/*.mat'; %more than 25000 files

A=ones(1483,2824,2);%aux matrix for stablish prdtile data type

y=tall(A);

%database of files cointaining 4x1cell of 1483*2824 maps

ds=fileDatastore(folder,'ReadFcn',@loadPrc,'FileExtensions','.mat','UniformRead', true)

t=tall(ds);

%fill the aux tall array with each map in the correct format

for i=1:25000

y(:,:,i)=t(1+(i-1)*1483:1483*i,:);

end

%calculate the percentile

p90_1=prctile(y,90,3)

P90_1=gather(p90_1);

save('/home/temporal2/dsantos/p90_1.mat','P90_1','-v7.3');

toc

But it seems that tall arrays won't work for this because I get the error:

Warning: Error encountered during preview of tall array 'p90_1'. At

tempting to

gather 'p90_1' will probably result in an error. The error encountered was:

Requested 500025x500025 (1862.8GB) array exceeds maximum array size preference.

Creation of arrays greater than this limit may take a long time and cause

MATLAB to become unresponsive. See <a href="matlab: helpview([docroot

'/matlab/helptargets.map'], 'matlab_env_workspace_prefs')">array size limit</a>

or preference panel for more information.

> In tall/display (line 21)

p90_1 =

MxNx... tall array

? ? ? ...

? ? ? ...

? ? ? ...

: : :

: : :

>> Error using digraph/distances (line 72)

Internal problem while evaluating tall expression. The problem was:

Requested 500028x500028 (1862.9GB) array exceeds maximum array size preference.

Creation of arrays greater than this limit may take a long time and cause

MATLAB to become unresponsive. See <a href="matlab: helpview([docroot

'/matlab/helptargets.map'], 'matlab_env_workspace_prefs')">array size limit</a>

or preference panel for more information.

Error in

matlab.bigdata.internal.lazyeval.LazyPartitionedArray>iGenerateMetadata (line

756)

allDistances = distances(cg.Graph);

Error in

matlab.bigdata.internal.lazyeval.LazyPartitionedArray>iGenerateMetadataFillingPart

itionedArrays

(line 739)

[metadatas, partitionedArrays] = iGenerateMetadata(inputArrays,

executorToConsider);

Error in ...

Error in tall/gather (line 50)

[varargout{:}] = iGather(varargin{:});

Caused by:

Error using matlab.internal.graph.MLDigraph/bfsAllShortestPaths

Requested 500028x500028 (1862.9GB) array exceeds maximum array size

preference. Creation of arrays greater than this limit may take a long time

and cause MATLAB to become unresponsive. See <a href="matlab:

helpview([docroot '/matlab/helptargets.map'],

'matlab_env_workspace_prefs')">array size limit</a> or preference panel for

more information.

Any clue on how to solve this problem?

All the best

Answer by Edric Ellis
on 13 Aug 2019

That particular error is an internal error basically because your tall array expression is simply too large - contains too many expressions. tall arrays operate by building up a symbolic representation of all the expressions you've evaluated, and then running them all together when you call gather. Because you've got a for loop over 25000 elements, this symbolic representation is large - too large to be evaluated. tall arrays are basically not designed to be looped over in this way. Instead, you need to express your program in terms of a smaller number of vectorised operations.

I would proceed in the following manner (I can't be more specific since your problem statement isn't executable - see this page on tips regarding making a minimal reproduction):

- Have your loadPrc return a 4 × 1483 × 2824 numeric matrix (rather than a cell array)
- Your corresponding tall array t will then be 25000 × 1483 × 2824
- Instead of the for loop, simply call prctile in dimension 1

ds = fileDatastore();

t = tall(ds);

p90_1=prctile(t,90,1);

P90_1=gather(p90_1);

% and then perhaps

P90_1 = shiftdim(P90_1, 1)

Sign in to comment.

Answer by David Santos
on 13 Aug 2019

Thanks a lot for your answer Edric!

I'm not sure how to solve point 1. Here's my simplified loadPrc:

function dataOut=loadPrc(filename)

data=load(filename);%data is a 4x1 cell, 4 frequency maps of 1483x2824points

dataOut=data{1};%let's solve just the first frequency map for the moment.. 1483x2824 matrix

end

how can I modify this to reach to your proposal?

I've tried this now I'm my server and because it has the 2017a version "'UniformRead', true" is not working so dataOut is always a cell. can I have a numeric matrix somehow¿?

In the other hand if I just calculate the percentile of one frequency map (as stated in loadPrc), dataOut is going to be 2d not 3d matrix. I'm doing this because if join the 4 frequencies=>dataOut=4x1483x2824 so, how can I calculate each frequency percentile? maybe I can do :

p90_1=prctile(t(1:4:end,:,:,90,1);

P90_1=gather(p90_1);

p90_2=prctile(t(2:4:end,:,:,90,1);

P90_2=gather(p90_2);

?

All the best

Edric Ellis
on 14 Aug 2019

Ah, sorry, I hadn't realised that prctile in the tall dimension supports only vectors. Hm, this might turn out to be trickier than I thought. In fact, I'm not sure I know how to do this using tall arrays.

Let me just confirm that I got the basics of your problem correct - you do want to compute percentiles individually for each 1483x2824 element - so 4187992 percentiles down vectors of length 25000.

It may be that tall arrays aren't the right tool in this case - at the very least, I think it will be necessary to "transpose" the data so that you can load a handful of 25000-element vectors in memory at a time and call prctile on those in sequence (perhaps even in parallel if you have Parallel Computing Toolbox).

David Santos
on 14 Aug 2019

Thanks for your answer!

-In the way I was working at the begining of my question (with that tricky aux tall array to format in the way percentile likes 3d tall array) I was able to calculate prctile over the 3rd dim of the tall array of size (1483x2824x25000) in the way:

p90_1=prctile(t,90,3);

P90_1=gather(p90_1);

the problem was that at the end, when I used gather, matlab needed to load the entire vector in memory and its always too big. I think tall arrays won't work because of this. It would be great to be able to load in memmory only the p90_1 variable instead of the entire (400 GB) t matrix.

-Yes, you got rigth, I want to compute percentiles individually for each 1483x2824 matrix/map. What you propose could be a solution but even with the parallel process (40 cores) it would imply a lot loading files isn't? I will try to do a minitest and see what happens

-What about other ways? Mapreduce? Using big matfiles on disk? Approximations to percentile as P2 algorithm?

David Santos
on 14 Aug 2019

SOME TESTING

I did some testing using just 4 maps/files/(1483x2824) matrixes with your "slicing" percentile calculation proposal. The first 2 option s(using matfile and tall arrays) only calculate the 10 first rows

%%Using matfile option

tic

folder=dir('matBorrame/*.mat'); %4 files folder

P=zeros(1483,2824,2);

save('P.mat','P','-v7.3');

m=matfile('P.mat','Writable',true);

for i=1:4

fprintf('%d\n',i);

v=load(strcat('matBorrame/',folder(i).name));

id=strcat('l',folder(i).name(1:end-4-7));

m.P(:,:,i)=v.(id){1};

end

p90_1=ones(1483,2824);

for r=1:10 %%

fprintf('ROW:%d\n',r);

for c=1:2824

p90_1(r,c)=prctile(m.P(r,c,:),90);

end

end

save('p5_90_4.mat','p90_1','-v7.3');

toc

%Elapsed time is 190.559574 seconds.%

%%Tall arrays option

tic

ds=fileDatastore('matBorrame','ReadFcn',@loadPrc,'FileExtensions','.mat','UniformRead', true)

t=tall(ds);

A=ones(1483,2824,2);%aux matrix for stablish prctile data format

y=tall(A);

for i=1:4

y(:,:,i)=t(1+(i-1)*1483:1483*i,:);

end

p90_1=ones(1483,2824);

for r=1:10

fprintf('ROW:%d\n',r);

for c=1:2824

aux=squeeze(y(r,c,:));

p90_1(r,c)=gather(prctile(aux,90));

end

end

save('p1_90_4.mat','p90_1','-v7.3');

toc

%Elapsed time is >>5000s.% I stopped before finish...

%%In memory option; Processing all rows!!

tic

p90_1=prctile(m.P,90,3);

toc

%Elapsed time is 1.335489 seconds.%

My conclusions:

- tall Arrays are a bad solution for your proposal, it will last forever...

- Using matfile could work but is around 4000 times slower than the standard solution

All the best

Sign in to comment.

Opportunities for recent engineering grads.

Apply Today
## 0 Comments

Sign in to comment.