How to speed up Bag of Visual Words Calculation ?

4 views (last 30 days)
I am building an image retrieval system where I would give an input image and the output would be the most similar images from the dataset. I have created a bag of visual word from my dataset with 2000 words. I followed the following process:
1. Extracted SIFT features from the images in the dataset and performed k-means clustering with k = 2000. Then I stored the centroids in a mat file.
2. For an input image, I am creating its bovw as follows:
indices_f=[];
for i=1:size(features,1) % this loop is taking too much time.
for j=1:size(centroids,1)
d(j)=pdist2(features(i,:),centroids(j,:)); % Compute distance from cluster centroids
end
[~,indices_f(i)]=min(d); %Compute Minimum Distance
end
sum=1;
feature_hist_local=zeros(1,2000);
temp=indices_f;
%Compute histogram
for m=1:length(temp)
feature_hist_local(temp(m))=feature_hist_local(temp(m))+1;
end
For my specific use case, the features matrix contains 3127 key points of dimension 128 each. The code is converting it into a bovw representation of size 1x2000. But this code is taking approximately 40 minutes to calculate the bovw representation on a machine with 32 GB RAM and 6 cores.
Is there a better way to do so ? Also, are there alternate approaches improvising upon bag of visual word approaches, since I plan to work with large datasets now.
  3 Comments
sid
sid on 21 Feb 2015
thanks. dist[] was unnecessary. does not pre-allocating d becomes such a high bottleneck ?
dpb
dpb on 21 Feb 2015
"... does not pre-allocating d become such a high bottleneck..."
NO. It's quite quick and efficient to allocate a fixed array; what is the bottleneck is the necessity to reallocate it on every pass thru a loop. That entails saving the present to a temporary, reallocating the existing memory of the target and then copying the content to that location. Since you've got counted loops and you know the size of the return vector, you can allocate the space required a priori.
You can also experiment with the use of the optional parameters in pidst2 and see if computing fewer is quicker there or whether the overhead internally is more than you save; I guess it probably depends on the size; if long maybe, if not so much, probably not.

Sign in to comment.

Accepted Answer

dpb
dpb on 21 Feb 2015
Looks like I glanced over the code too quickly and missed the key point, perhaps...
for i=1:size(features,1)
for j=1:size(centroids,1)
d(j)=pdist2(features(i,:),centroids(j,:));
...
In the above you're computing the full array of comparative distances on a row-by-row basis but then you only save the results for one column; the last index for i; all the rest are thrown away. I presume this isn't really what you intend? If you're really looking for the minimum across all combinations, then you get the same result at far better performance with
d=pdist2(features,centroids);
and since you're looking later on for the minimum, then the previous comment regarding the optional parameters is useful
[d,idxMin]=pdist2(features,centroids,'euclidean','smallest',1);
  2 Comments
sid
sid on 24 Feb 2015
thanks. But I don't understand what does the internal implementation does that it works so fast.
dpb
dpb on 24 Feb 2015
The computational guts are a compiled mex file and your loop is very inefficient in comparison, particularly as the first implementation with dynamically allocating the results vector. You're also duplicating results I think that are symmetric otherwise )altho I didn't go back to really verify this in the original; it's a gut reaction opinion just thinking superficially...)

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!