imread - for vs. parfor - not seeing any gains
Show older comments
I am processing a lot of images for work and wanted to take advantage of the parallel architecture to save some time:
With a for loop, my script takes 11.0621 hrs to finish. With a parfor loop, my script takes 10.9742 hrs to finish.
The gain there is so minimal that it could just be fluctuation. The code I'm using calls imread and then regionprops about 800 times per region, and there are 8 regions for 13 samples. Trying to get at what the problem might be, I tried substituting a comparable image that was already in my workspace in lieu of the imread call and in that case the parfor loop went way faster than the for loop (2.692 hrs vs. 4.9351 hrs). This makes me think that my code is fine in terms of the amount of information being passed between workers, that I have a parallel pool, etc.
Anyone have any idea why the gains of using the parallel structure are so minimal with the imread call? Is it my hardware? All code was run on a 64-bit Windows 7 environment with 32 GB of RAM, a 3.4 GHz Intel i7 processor, and two NVIDIA GeForce GTX 570 graphics cards.
Here is a skeleton of my code:
Sample = {'H1', 'H2', 'H3', 'H4', 'H5', 'H6', 'H7', 'H8', 'H9', 'H10', 'H11', 'H12', 'H13'};
Regions = {'R1', 'R2', 'R3', 'R4', 'R5', 'R6', 'R7', 'R8'};
try
parpool('local', 4);
catch
end
AllSamplesTic = tic;
for S = Samples
Sample = char(S);
for R = Regions
Region = char(R);
cd(Sample '\' Region);
load('RegionInfo.mat');
TotalImages = RegionInfo(:,5);
for SliceNum = 1:size(TotalImages,1)
% Gather some info
parfor Ind = 1:TotalImages(SliceNum,1)
I = imread(['pic' num2str(i) '.png']); % or I = dummyImage;
props = regionsprops(I, 'Area', 'Centroid', 'Eccentricity', 'Orientation');
% Do some calculations here
% Store results
end
end
end
end
Thanks!
4 Comments
Thomas Ibbotson
on 13 Jun 2014
Other people have made suggestions that you might be saturating the bandwidth to disk, which is certainly possible. However, just in case, I'll ask the "are you sure it's plugged in?" question.
You have a "try/catch ignore" wrapped around your parpool. Why did you do that? Was it throwing errors? Perhaps your parpool threw an error and didn't open, your code as it is would not show that if it happened.
Thomas Ibbotson
on 13 Jun 2014
Ok, I know this has nothing to do with your original problem, but you can use 'gcp' to do that. See:
help gcp
STBLer
on 13 Jun 2014
Answers (2)
My guess would be that the parallel labs are all fighting each other for access to your hard drive and/or you have a slow hard drive. It might be worth a try to read the images in serially into a cell array (or into the slices of a 3D array if all are the same size) and then broadcast them to parfor.
10 Comments
Why would there be conflict accessing the hard drive? They are all trying to access a different image on the same hard drive.
But the cores still share a common channel to the hard drive, no? You don't have parallel buses to the hard drive that each core can use independently. Each parallel worker has to wait its turn for the bus, keeping itself as busy as possible in the meantime with computations, e.g., regionprops operations. Since you don't seem to have a large compute-to-memory ratio, your workers will be idle much of the time.
I don't expect to see huge gains here though as the computational part of it obviously isn't that intensive (only saved ~2.5 hours when using the dummy image)...
I'm not terribly hopeful either.
Image Analyst
on 13 Jun 2014
You're not going to be able to fit 83,200 images into a cell array - you'll run out of memory.
Matt J
on 13 Jun 2014
I meant read in the numSlices=800 images into an array. i.e., do the serial imread() just prior to the inner parfor loop.
But... it's just intuition on my part that this might do any good. A speculation that a single core serial read from the hard drive will be faster than a multi-core read.
STBLer
on 13 Jun 2014
Matt J
on 13 Jun 2014
I don't know how to reconcile that with the code you've posted. In that code, you perform numSlices imread's for every Region and Sample. So, "the number of images for the current Region and Sample" should be the same as numSlices=17.
Considering I gain 2 hrs when I use the dummy image I am inclined to disagree that 17 slices is too small to be parallelizing...
The gain of 2 hours represents a factor of 2 speed-up. It's something, but with a parpool of 4 workers, you would hope for something closer to a factor of 4. Clearly the ratio of computation to communication is still not terribly favorable and the size of the parfor loop has a bearing on this.
Is there a way to track the idle time of a worker to see if your guess checks out?
Comment out all the processing steps inside the parfor loop apart from the imread and see how much things speed up. Incidentally, you could then repeat the same, but with 'parfor' replaced with plain 'for'. This would give us an idea how well we do reading from the hard drive in parallel vs. serially.
Rather than waiting 5-10 hours, I'd of course recommend you do these comparisons with the outer loop restricted to a smaller number of Regions and Samples.
Image Analyst
on 12 Jun 2014
0 votes
Your hardware is pretty impressive. Doesn't seem like it should take 11 hours. What is the value of numSlices? The badly-named I is a binary image, right, not a gray scale image? Are there tons of regions in it? (That's the normal definition of regions, not your custom definition.) Perhaps some noise reduction would speed up the regionprops() if you're spending a lot of time measuring useless little bits of noise.
3 Comments
STBLer
on 12 Jun 2014
Image Analyst
on 13 Jun 2014
You didn't answer the question about what numSlices is. Is it 800 slices? So you have 8*13*800 = 83,200 images to do regionprops() on?
Categories
Find more on Image Filtering in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!