Memory problem reading data in large netcdf files for a single variable

Question

Randall on 19 Aug 2017

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/353292-memory-problem-reading-data-in-large-netcdf-files-for-a-single-variable

Edited: per isakson on 21 Aug 2017

I am reading in a loop a single variable from hourly netcdf files and I need to calculate things like standard deviations, prctiles, max/min, etc. Each netcdf file reads in data in 3 dimensions using netcdf.getVar(ncid, varname) with a var dimension for each file and loop iteration in the size 684,447,72. I have 3,250 files to read so the dimension will be 684x447x72x3250. Is there a way to somehow flatten "var" so I can calculate prctile so that "var" is updated or appended after each loop or file read rather than being constantly re-written? Thank you! I've been working on this for days now.

2 Comments
Show NoneHide None

per isakson on 19 Aug 2017

Edited: per isakson on 19 Aug 2017

Open in MATLAB Online

The title says "Memory problem".

>> 684*447*72*3250*8/1e9
ans =
     5.723602560000001e+02

Assuming double, that's 600GB. "somehow flatten "var"" doesn't change the size.
Is speed a problem?
"that "var" is updated or appended after each loop or file" fine with std, min and max, but not with prctiles
Have you studied Large Files and Big Data? I'm not sure it would help in your case.
To help, I need more detail regarding your use case.

Randall on 19 Aug 2017

Hi - reading the files takes takes about 3-4 seconds to extract one variable - so that's fine. I'm reading the variable from the file at once so "var" is 684x447x72 at each iteration. If I use sum, max, min then that reduces the size/memory to 684x447, which is the goal with prctile and std also. I've used double with the sum and single where i can. My goal is to obtain a 684x447 of prctile 10, 50 and 90 and also std. The common variable read from the netcdf file is wind speed. Thank you for your reply and I hope this helps. Randall

Sign in to comment.

Sign in to answer this question.

Answer 1

Image Analyst on 19 Aug 2017

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/353292-memory-problem-reading-data-in-large-netcdf-files-for-a-single-variable#answer_278513

Do you really need all that data in memory at one time? Maybe you can process an image at a time or else read in one image and subsample it.

Or use memmapfile().

1 Comment
Show -1 older commentsHide -1 older comments

Image Analyst on 19 Aug 2017

Open in MATLAB Online

We thought you had a problem even reading one single file into memory. Is that the case or not? Now, it sounds like it's not. If it's not then you don't need a loop over slices within an array. You can simply have one loop over files and re-use the same variable for each file (let's call it data).

for k = 1 : numFiles
    % Get filename somehow.
    % Read in data
    data = readnetcdf(); % Or however....
    % Compute statistics:
    means(k) = mean(data(:));
    maxs(k) = max(data(:));
    mins(k) = min(data(:));
    stDevs(k) = std(data(:));
    vars(k) = var(data(:));
end

Sign in to comment.

Answer 2

per isakson on 19 Aug 2017

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/353292-memory-problem-reading-data-in-large-netcdf-files-for-a-single-variable#answer_278517

Edited: per isakson on 21 Aug 2017

Open in MATLAB Online

Am I correct that you want to calculate the statistical measures (min, max, etc.) for 684x447 time series of 3250 hours length (3250 hourly files)? I assume you do.

You propose

for file = all_files
    read file
    update intermediate values
    clear file
end
finally calculate min, max, etc. based on intermediate values

That is fine for min, max and std. For std the intermediate values will include sums, sums of squares, etc.

"which is the goal with prctile [...] also" I don't think that is possible. I think you need an entire time series to calculate a prctile.

An alternate approach is to read the files piecewise

loop over all slices of the 684x447x72 array  
    start = 
    count =
    loop over all files 
        vardata = ncread( source, varname, start, count ); % overwrite vardata
        calculate min,max,std,prctile 
    endloop
endloop

Execution time might be a problem. How the slices are chosen will affect the reading time. Matlab is column major and netCDF is row(?) major. One want to read contiguous chunks from memory.

ADDENDUM 2017-08-20:

The script below is an exercise with ncread(...,start,count). It creates three identical small nc-files, which contains a 3D-array named block. (I chose the default nc-format, which is netcdf4_classic.) Next it reads "slices" of that block and compares with the Matlab array.

%%some data, which are easy to read and remember
isOk = false(0);
h = 6;  % Height
w = 3;  % Width
t = 12; % Time
vec = ( 1 : (h*w*t) );   
block = reshape( vec, h, w, t );
%%create three identical files
delete( 'h:\m\cssm\randall*.nc' )
nccreate( 'h:\m\cssm\randall_1.nc', 'block'             ...
        , 'Dimensions',{'Height',h,'Width',w,'Time',t}  )
ncwrite( 'h:\m\cssm\randall_1.nc', 'block', block );
copyfile( 'h:\m\cssm\randall_1.nc','h:\m\cssm\randall_2.nc' );
copyfile( 'h:\m\cssm\randall_1.nc','h:\m\cssm\randall_3.nc' );
%%read the entire array
C1 = ncread( 'h:\m\cssm\randall_1.nc', 'block' );
isOk(end+1) = all(all(all(C1==block)));
C1111 = ncread( 'h:\m\cssm\randall_1.nc', 'block', [1,1,1], [1,1,t] );
C1111 = squeeze( C1111 );
isOk(end+1) = all( C1111==squeeze(block(1,1,1:t)) );
C2111 = ncread( 'h:\m\cssm\randall_1.nc', 'block', [2,1,1], [1,1,t] );
C2111 = squeeze( C2111 );
isOk(end+1) = all( C2111==squeeze(block(2,1,1:t)) );
C1131 = ncread( 'h:\m\cssm\randall_1.nc', 'block', [1,1,1], [3,1,t] );
C1131 = squeeze( C1131 );
isOk(end+1) = all( all( C1131==squeeze(block(1:3,1,1:t)) ) );

Change the filespec and run the script. Make some checks:

>> isOk
isOk =
     1     1     1     1
>>
>> ncdisp( 'h:\m\cssm\randall_1.nc')
Source:
           h:\m\cssm\randall_1.nc
Format:
           netcdf4_classic
Dimensions:
           Height = 6
           Width  = 3
           Time   = 12
Variables:
    block
           Size:       6x3x12
           Dimensions: Height,Width,Time
           Datatype:   double
>>

The function, read_nc_piesewise(), is an implementation of "An alternate approach is to read the files piecewise". It calculates maximum for each point-time-series of the three nc-files created by the script.

>> read_nc_piesewise
ans =
   199   205   211
   200   206   212
   201   207   213
   202   208   214
   203   209   215
   204   210   216
>>

these maximum values are identical to block(:,:,12) (as they should)

>> block(:,:,12)
ans =
 205   211
 206   212
 207   213
 208   214
 209   215
 210   216

where

function    out = read_nc_piesewise()    
    %
    folder = 'h:\m\cssm\';
    sad = dir( fullfile( folder, 'randall*.nc' ) );
    number_of_files = length( sad );
    %
    height_of_block = 6;
    width_of_block  = 3;
    length_of_time  = 12;
    %
    number_of_subblocks = 3; 
    height_of_subblock  = height_of_block / number_of_subblocks;
    %
    out = nan( height_of_block, width_of_block );
    for jj = 1 : width_of_block
        for ii = 1 : number_of_subblocks
            start = [ 1+(ii-1)*height_of_subblock, jj, 1 ];
            count = [ height_of_subblock, 1, length_of_time ];
            subblock = nan( height_of_subblock, length_of_time*number_of_files );
            for kk = 1 : number_of_files
                ffs = fullfile( folder, sad(kk).name );
                num = ncread( ffs, 'block', start, count ); 
                num = squeeze( num );
                subblock( :, (1+(kk-1)*length_of_time : kk*length_of_time) ) = num;
            end
            out( (1+(ii-1)*height_of_subblock : ii*height_of_subblock), jj ) ...
            =   max( subblock, [], 2 );
        end
    end
end

3 Comments
Show 1 older commentHide 1 older comment

per isakson on 20 Aug 2017

Edited: per isakson on 20 Aug 2017

I note that you pose new questions, without answering my question:

"Am I correct that you want to calculate the statistical measures (min, max, etc.) for 684x447 time series of 3250 hours length (3250 hourly files)?"

I reformatted your comment so that the questions stand out.

per isakson on 20 Aug 2017

Edited: per isakson on 21 Aug 2017

"I didn't understand that from my internet searches" Matlab has a good documentation. Why not use that in the first place: see ncread, Read data from variable in NetCDF data source. Internet is full of obsolete and misleading information together with some good information, e.g. the Matlab documentation.
"Using ncread instead of netcdf.getVAr" If a high level function, e.g. ncread, does the job don't bother with a low level function, e.g. netcdf.getVAr. ncread is somewhat slower because it opens and closes the file for every call. Opening a nc-file that store many variables is a bit slow because (I guess) it builds some kind of index. And see Premature Optimization.
"is the "slice" loop the 72 data element" No
"Does start and count allow you to read in one-by-one the 684 and 447 data points?" Yes, but because of speed it's better to read as many as memory "allows". However, the system cache of Windows needs memory to read the files. There is a trade-off.
"The inner loop in your example would be #files." Yes

Sign in to comment.

Memory problem reading data in large netcdf files for a single variable

2 Comments
Show NoneHide None

Answers (2)

1 Comment
Show -1 older commentsHide -1 older comments

3 Comments
Show 1 older commentHide 1 older comment

See Also

Categories

Tags

Community Treasure Hunt

Memory problem reading data in large netcdf files for a single variable

2 Comments Show NoneHide None

Answers (2)

1 Comment Show -1 older commentsHide -1 older comments

3 Comments Show 1 older commentHide 1 older comment

See Also

Categories

Tags

Community Treasure Hunt

2 Comments
Show NoneHide None

1 Comment
Show -1 older commentsHide -1 older comments

3 Comments
Show 1 older commentHide 1 older comment