imread - for vs. parfor - not seeing any gains

I am processing a lot of images for work and wanted to take advantage of the parallel architecture to save some time:
With a for loop, my script takes 11.0621 hrs to finish. With a parfor loop, my script takes 10.9742 hrs to finish.
The gain there is so minimal that it could just be fluctuation. The code I'm using calls imread and then regionprops about 800 times per region, and there are 8 regions for 13 samples. Trying to get at what the problem might be, I tried substituting a comparable image that was already in my workspace in lieu of the imread call and in that case the parfor loop went way faster than the for loop (2.692 hrs vs. 4.9351 hrs). This makes me think that my code is fine in terms of the amount of information being passed between workers, that I have a parallel pool, etc.
Anyone have any idea why the gains of using the parallel structure are so minimal with the imread call? Is it my hardware? All code was run on a 64-bit Windows 7 environment with 32 GB of RAM, a 3.4 GHz Intel i7 processor, and two NVIDIA GeForce GTX 570 graphics cards.
Here is a skeleton of my code:
Sample = {'H1', 'H2', 'H3', 'H4', 'H5', 'H6', 'H7', 'H8', 'H9', 'H10', 'H11', 'H12', 'H13'};
Regions = {'R1', 'R2', 'R3', 'R4', 'R5', 'R6', 'R7', 'R8'};
try
parpool('local', 4);
catch
end
AllSamplesTic = tic;
for S = Samples
Sample = char(S);
for R = Regions
Region = char(R);
cd(Sample '\' Region);
load('RegionInfo.mat');
TotalImages = RegionInfo(:,5);
for SliceNum = 1:size(TotalImages,1)
% Gather some info
parfor Ind = 1:TotalImages(SliceNum,1)
I = imread(['pic' num2str(i) '.png']); % or I = dummyImage;
props = regionsprops(I, 'Area', 'Centroid', 'Eccentricity', 'Orientation');
% Do some calculations here
% Store results
end
end
end
end
Thanks!

4 Comments

Other people have made suggestions that you might be saturating the bandwidth to disk, which is certainly possible. However, just in case, I'll ask the "are you sure it's plugged in?" question.
You have a "try/catch ignore" wrapped around your parpool. Why did you do that? Was it throwing errors? Perhaps your parpool threw an error and didn't open, your code as it is would not show that if it happened.
STBLer
STBLer on 13 Jun 2014
Edited: STBLer on 13 Jun 2014
The try catch was just there in case I had already started my parallel pool. The parallel pool indicator in the bottom left of the command window is green and showing that the workers are active.
Ok, I know this has nothing to do with your original problem, but you can use 'gcp' to do that. See:
help gcp
Yeah. I guess that's safer than the try catch set up.

Sign in to comment.

Answers (2)

Matt J
Matt J on 12 Jun 2014
Edited: Matt J on 13 Jun 2014
My guess would be that the parallel labs are all fighting each other for access to your hard drive and/or you have a slow hard drive. It might be worth a try to read the images in serially into a cell array (or into the slices of a 3D array if all are the same size) and then broadcast them to parfor.

10 Comments

STBLer
STBLer on 13 Jun 2014
Edited: STBLer on 13 Jun 2014
Why would there be conflict accessing the hard drive? They are all trying to access a different image on the same hard drive.
MATLAB is being run off a 128 GB solidstate drive and I am using a 64 GB solidstate as my cache drive. The images are stored on a conventional spinning disk hard drive from Seagate spinning at 7200 RPM with 6 Gb/s advertised speed (st3000dm001).
I will try your suggestion of reading the images into a cell array and then broadcasting them to the parallel workers. I don't expect to see huge gains here though as the computational part of it obviously isn't that intensive (only saved ~2.5 hours when using the dummy image)...
Matt J
Matt J on 13 Jun 2014
Edited: Matt J on 13 Jun 2014
Why would there be conflict accessing the hard drive? They are all trying to access a different image on the same hard drive.
But the cores still share a common channel to the hard drive, no? You don't have parallel buses to the hard drive that each core can use independently. Each parallel worker has to wait its turn for the bus, keeping itself as busy as possible in the meantime with computations, e.g., regionprops operations. Since you don't seem to have a large compute-to-memory ratio, your workers will be idle much of the time.
I don't expect to see huge gains here though as the computational part of it obviously isn't that intensive (only saved ~2.5 hours when using the dummy image)...
I'm not terribly hopeful either.
You're not going to be able to fit 83,200 images into a cell array - you'll run out of memory.
I meant read in the numSlices=800 images into an array. i.e., do the serial imread() just prior to the inner parfor loop.
But... it's just intuition on my part that this might do any good. A speculation that a single core serial read from the hard drive will be faster than a multi-core read.
STBLer
STBLer on 13 Jun 2014
Edited: STBLer on 13 Jun 2014
Each image is a 480 x 480 uint8 array. I don't think I'd have enough memory to fit 15,062 images into a cell array and then do anything useful, if that would fit into memory at all (that'd be 3,470,300,000 bytes = 3.5 GB alone I think).
Matt J
Matt J on 13 Jun 2014
Edited: Matt J on 13 Jun 2014
I re-iterate. I only meant that you would read in numSlices images serially for the current Region and Samples. numSlices=17 seems like a very small loop to be parallelizing, however. I wager that's why you're seeing little benefit.
That number (15k) would be the number of images for the current Region and Sample. Considering I gain 2 hrs when I use the dummy image I am inclined to disagree that 17 slices is too small to be parallelizing...
I don't know how to reconcile that with the code you've posted. In that code, you perform numSlices imread's for every Region and Sample. So, "the number of images for the current Region and Sample" should be the same as numSlices=17.
STBLer
STBLer on 14 Jun 2014
Edited: STBLer on 14 Jun 2014
Ah - That was my fault for trying to simplify the code too much to post here and for the frustrated previous response. That parfor is for 1:numImages with each slice having its own specific number of images. You end up parallelizing on average 866 images with the code on my machine. I updated the skeleton of my code to reflect this. Sorry about that.
Also, I missed your response regarding the hard drive access: Is there a way to track the idle time of a worker to see if your guess checks out? From my understanding, the run and time function would only tell me what my local worker is doing and nothing about the parallel workers.
I guess I naively thought that you would be capable of writing/reading the hard drive simultaneously on each of the 4 CPUs at least from my understanding of how addressing memory works on a CPU.
Matt J
Matt J on 14 Jun 2014
Edited: Matt J on 14 Jun 2014
Considering I gain 2 hrs when I use the dummy image I am inclined to disagree that 17 slices is too small to be parallelizing...
The gain of 2 hours represents a factor of 2 speed-up. It's something, but with a parpool of 4 workers, you would hope for something closer to a factor of 4. Clearly the ratio of computation to communication is still not terribly favorable and the size of the parfor loop has a bearing on this.
Is there a way to track the idle time of a worker to see if your guess checks out?
Comment out all the processing steps inside the parfor loop apart from the imread and see how much things speed up. Incidentally, you could then repeat the same, but with 'parfor' replaced with plain 'for'. This would give us an idea how well we do reading from the hard drive in parallel vs. serially.
Rather than waiting 5-10 hours, I'd of course recommend you do these comparisons with the outer loop restricted to a smaller number of Regions and Samples.

Sign in to comment.

Your hardware is pretty impressive. Doesn't seem like it should take 11 hours. What is the value of numSlices? The badly-named I is a binary image, right, not a gray scale image? Are there tons of regions in it? (That's the normal definition of regions, not your custom definition.) Perhaps some noise reduction would speed up the regionprops() if you're spending a lot of time measuring useless little bits of noise.

3 Comments

I is not read in as a binary image. I obviously binarize it before running regionprops. There are however quite a few regions. I could potentially despeckle the image before running regionprops. I hadn't bothered thinking about optimizing yet because I was trying to figure out why the parfor was not running much faster than the for loop.
You didn't answer the question about what numSlices is. Is it 800 slices? So you have 8*13*800 = 83,200 images to do regionprops() on?
STBLer
STBLer on 13 Jun 2014
Edited: STBLer on 14 Jun 2014
Sorry about that. numSlices varies from sample to sample, but on average it is 16.8163 so call that 17. I oversimplified my skeleton and forgot to include a critical for loop. The code has been updated to reflect this. The number of images processed per slice also varies, but on average is 886, so ballpark there are 15,062 images that need regionprops performed on them for each region, bringing it to a total of 1,459,631 images that get processed over the 11 hrs.
Since you have a sufficiently high reputation, it might be helpful to clean this question up a bit. Also, what do you think of the suggestion that the parallel workers are spending a substantial amount of time being idle because of conflicts accessing the hard drive?

Sign in to comment.

Asked:

on 12 Jun 2014

Edited:

on 14 Jun 2014

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!