Loading mat files within parfor takes long time
Show older comments
Hi,
I am writing the code below to load/process 2627 files under parfor with the size of each file is 100MB, makes it total of 272GB. It is taking very long time to process. I actually use a progress bar for parfor, not shown in this simplified code, and the progress bar even hasn't started yet. I think it is still doing some overhead work. When I use for-loop, I think it is not slower than parfor. Can someone please give me a suggestion? My computer: 40 cores and RAM of 383GB
Merry Xmas,
clear all;
folder_mat_file = '.\Mat_files_full\';
mat_filename_valid = dir([folder_mat_file '*.mat']);
no_mat_files = numel(mat_filename_valid);
if isempty(gcp)
parpool;
end
poolobj = gcp;
for n=1:no_mat_files
filename_cell{n} = [folder_mat_file mat_filename_valid(n).name];
end
addAttachedFiles(poolobj,filename_cell)
parfor n=1:no_mat_files
% data_struct is temporary variable
data_struct = parload([folder_mat_file mat_filename_valid(n).name]);
%% Here we do processing
end
%% Below is the function parload use to load file within parfor
function x = parload(filename)
x = load(filename);
end
7 Comments
Walter Roberson
on 24 Dec 2018
There should be no need to addAttachedFiles of all of the files. Provided that you are passing in the full file name, you should be be to load(). You should not even need to put the load() inside a function
thisdir = pwd();
folder_mat_file = fullfile(thisdir, 'Mat_files_full');
mat_filename_valid = dir( fullfile(folder_mat_file, '*.mat') );
no_mat_files = numel(mat_filename_valid);
filename_cell = fullfile(folder_mat_file, {mat_filename_valid.name});
if isempty(gcp)
parpool;
end
poolobj = gcp;
parfor n=1:no_mat_files
% data_struct is temporary variable
data_struct = load(filename_cell{n});
%% Here we do processing
end
You will probably get some controller contention for loading the files, so do not expect speedup proportional to the number of workers. I suspect that the performance will get worse after around four cores -- unless, that is, that the "here we do processing" takes longer than loading the files. Depending on the processing you are doing, you might find it worthwhile to parpool a fraction of the available cores, but configure the pool to give you several cores per worker; see e.g., https://www.mathworks.com/matlabcentral/answers/395333-using-parfor-on-cluster-changing-number-of-cores-per-worker#answer_315541
Walter Roberson
on 24 Dec 2018
You would want to save in a separate function if you are using save without specifying which variables to save, which is not the most recommended form.
Dave Lee
on 24 Dec 2018
Walter Roberson
on 24 Dec 2018
Ah, they used to permit it, but it would give a transparency error if you did not name all of the variables.
Continue on with your parsave routine then.
Answers (0)
Categories
Find more on Parallel for-Loops (parfor) in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!