Memory problem reading data in large netcdf files for a single variable

10 views (last 30 days)
I am reading in a loop a single variable from hourly netcdf files and I need to calculate things like standard deviations, prctiles, max/min, etc. Each netcdf file reads in data in 3 dimensions using netcdf.getVar(ncid, varname) with a var dimension for each file and loop iteration in the size 684,447,72. I have 3,250 files to read so the dimension will be 684x447x72x3250. Is there a way to somehow flatten "var" so I can calculate prctile so that "var" is updated or appended after each loop or file read rather than being constantly re-written? Thank you! I've been working on this for days now.
  2 Comments
per isakson
per isakson on 19 Aug 2017
Edited: per isakson on 19 Aug 2017
The title says "Memory problem".
>> 684*447*72*3250*8/1e9
ans =
5.723602560000001e+02
  • Assuming double, that's 600GB. "somehow flatten "var"" doesn't change the size.
  • Is speed a problem?
  • "that "var" is updated or appended after each loop or file" fine with std, min and max, but not with prctiles
  • Have you studied Large Files and Big Data? I'm not sure it would help in your case.
  • To help, I need more detail regarding your use case.
Randall
Randall on 19 Aug 2017
Hi - reading the files takes takes about 3-4 seconds to extract one variable - so that's fine. I'm reading the variable from the file at once so "var" is 684x447x72 at each iteration. If I use sum, max, min then that reduces the size/memory to 684x447, which is the goal with prctile and std also. I've used double with the sum and single where i can. My goal is to obtain a 684x447 of prctile 10, 50 and 90 and also std. The common variable read from the netcdf file is wind speed. Thank you for your reply and I hope this helps. Randall

Sign in to comment.

Answers (2)

Image Analyst
Image Analyst on 19 Aug 2017
Do you really need all that data in memory at one time? Maybe you can process an image at a time or else read in one image and subsample it.
Or use memmapfile().
  1 Comment
Image Analyst
Image Analyst on 19 Aug 2017
We thought you had a problem even reading one single file into memory. Is that the case or not? Now, it sounds like it's not. If it's not then you don't need a loop over slices within an array. You can simply have one loop over files and re-use the same variable for each file (let's call it data).
for k = 1 : numFiles
% Get filename somehow.
% Read in data
data = readnetcdf(); % Or however....
% Compute statistics:
means(k) = mean(data(:));
maxs(k) = max(data(:));
mins(k) = min(data(:));
stDevs(k) = std(data(:));
vars(k) = var(data(:));
end

Sign in to comment.


per isakson
per isakson on 19 Aug 2017
Edited: per isakson on 21 Aug 2017
Am I correct that you want to calculate the statistical measures (min, max, etc.) for 684x447 time series of 3250 hours length (3250 hourly files)? I assume you do.
You propose
for file = all_files
read file
update intermediate values
clear file
end
finally calculate min, max, etc. based on intermediate values
That is fine for min, max and std. For std the intermediate values will include sums, sums of squares, etc.
"which is the goal with prctile [...] also" I don't think that is possible. I think you need an entire time series to calculate a prctile.
An alternate approach is to read the files piecewise
loop over all slices of the 684x447x72 array
start =
count =
loop over all files
vardata = ncread( source, varname, start, count ); % overwrite vardata
calculate min,max,std,prctile
endloop
endloop
Execution time might be a problem. How the slices are chosen will affect the reading time. Matlab is column major and netCDF is row(?) major. One want to read contiguous chunks from memory.
ADDENDUM 2017-08-20:
The script below is an exercise with ncread(...,start,count). It creates three identical small nc-files, which contains a 3D-array named block. (I chose the default nc-format, which is netcdf4_classic.) Next it reads "slices" of that block and compares with the Matlab array.
%%some data, which are easy to read and remember
isOk = false(0);
h = 6; % Height
w = 3; % Width
t = 12; % Time
vec = ( 1 : (h*w*t) );
block = reshape( vec, h, w, t );
%%create three identical files
delete( 'h:\m\cssm\randall*.nc' )
nccreate( 'h:\m\cssm\randall_1.nc', 'block' ...
, 'Dimensions',{'Height',h,'Width',w,'Time',t} )
ncwrite( 'h:\m\cssm\randall_1.nc', 'block', block );
copyfile( 'h:\m\cssm\randall_1.nc','h:\m\cssm\randall_2.nc' );
copyfile( 'h:\m\cssm\randall_1.nc','h:\m\cssm\randall_3.nc' );
%%read the entire array
C1 = ncread( 'h:\m\cssm\randall_1.nc', 'block' );
isOk(end+1) = all(all(all(C1==block)));
C1111 = ncread( 'h:\m\cssm\randall_1.nc', 'block', [1,1,1], [1,1,t] );
C1111 = squeeze( C1111 );
isOk(end+1) = all( C1111==squeeze(block(1,1,1:t)) );
C2111 = ncread( 'h:\m\cssm\randall_1.nc', 'block', [2,1,1], [1,1,t] );
C2111 = squeeze( C2111 );
isOk(end+1) = all( C2111==squeeze(block(2,1,1:t)) );
C1131 = ncread( 'h:\m\cssm\randall_1.nc', 'block', [1,1,1], [3,1,t] );
C1131 = squeeze( C1131 );
isOk(end+1) = all( all( C1131==squeeze(block(1:3,1,1:t)) ) );
Change the filespec and run the script. Make some checks:
>> isOk
isOk =
1 1 1 1
>>
>> ncdisp( 'h:\m\cssm\randall_1.nc')
Source:
h:\m\cssm\randall_1.nc
Format:
netcdf4_classic
Dimensions:
Height = 6
Width = 3
Time = 12
Variables:
block
Size: 6x3x12
Dimensions: Height,Width,Time
Datatype: double
>>
The function, read_nc_piesewise(), is an implementation of "An alternate approach is to read the files piecewise". It calculates maximum for each point-time-series of the three nc-files created by the script.
>> read_nc_piesewise
ans =
199 205 211
200 206 212
201 207 213
202 208 214
203 209 215
204 210 216
>>
these maximum values are identical to block(:,:,12) (as they should)
>> block(:,:,12)
ans =
199 205 211
200 206 212
201 207 213
202 208 214
203 209 215
204 210 216
where
function out = read_nc_piesewise()
%
folder = 'h:\m\cssm\';
sad = dir( fullfile( folder, 'randall*.nc' ) );
number_of_files = length( sad );
%
height_of_block = 6;
width_of_block = 3;
length_of_time = 12;
%
number_of_subblocks = 3;
height_of_subblock = height_of_block / number_of_subblocks;
%
out = nan( height_of_block, width_of_block );
for jj = 1 : width_of_block
for ii = 1 : number_of_subblocks
start = [ 1+(ii-1)*height_of_subblock, jj, 1 ];
count = [ height_of_subblock, 1, length_of_time ];
subblock = nan( height_of_subblock, length_of_time*number_of_files );
for kk = 1 : number_of_files
ffs = fullfile( folder, sad(kk).name );
num = ncread( ffs, 'block', start, count );
num = squeeze( num );
subblock( :, (1+(kk-1)*length_of_time : kk*length_of_time) ) = num;
end
out( (1+(ii-1)*height_of_subblock : ii*height_of_subblock), jj ) ...
= max( subblock, [], 2 );
end
end
end
  3 Comments
per isakson
per isakson on 20 Aug 2017
Edited: per isakson on 20 Aug 2017
I note that you pose new questions, without answering my question:
"Am I correct that you want to calculate the statistical measures (min, max, etc.) for 684x447 time series of 3250 hours length (3250 hourly files)?"
I reformatted your comment so that the questions stand out.
per isakson
per isakson on 20 Aug 2017
Edited: per isakson on 21 Aug 2017
  • "I didn't understand that from my internet searches" Matlab has a good documentation. Why not use that in the first place: see ncread, Read data from variable in NetCDF data source. Internet is full of obsolete and misleading information together with some good information, e.g. the Matlab documentation.
  • "Using ncread instead of netcdf.getVAr" If a high level function, e.g. ncread, does the job don't bother with a low level function, e.g. netcdf.getVAr. ncread is somewhat slower because it opens and closes the file for every call. Opening a nc-file that store many variables is a bit slow because (I guess) it builds some kind of index. And see Premature Optimization.
  • "is the "slice" loop the 72 data element" No
  • "Does start and count allow you to read in one-by-one the 684 and 447 data points?" Yes, but because of speed it's better to read as many as memory "allows". However, the system cache of Windows needs memory to read the files. There is a trade-off.
  • "The inner loop in your example would be #files." Yes

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!