Optimizing Clustering of Array Elements

5 views (last 30 days)
The Task: a 1-D column array of coordinates exists. Most of those coordinates repeat more than once.
1. look for the most repeated element; 2. assign a cluster id to it; 3. include coordinates that lie in +/-10 units to the same cluster 4. remove clustered values from the array 5. proceed with the second most repeated element and loop the same steps as above 6. do the procedure until coordinates are not repeated
Here is a sample code:
function [clusterArray,frequencyArray] = ClusterWithModeSeed(inputArray)
% Cluster vector elements to seeds determined by elements frequency
% element should be repeated at least two times to be considered in
% a cluster
% Function Input
inputSize=size(inputArray,1);
index=(1:inputSize)';
clusterArray=zeros(inputSize,1);
cluster=0;
frequencyArray=zeros(inputSize,1);
freq=2;
while freq>1
% Update cluster ID
cluster=cluster+1;
% Calculate Mode And Frequency
[mfv,freq]=mode(inputArray);
% All Elements In +/-10 range of the most frequent value
tfCluster=inputArray>=mfv-10&inputArray<=mfv+10;
% Set Cluster % Frequency Outputs
clusterArray(index(tfCluster))=cluster;
frequencyArray(cluster)=sum(tfCluster);
% Update Index And Data
inputArray(tfCluster)=[];
index(tfCluster)=[];
end
% Remove Empty Frequency
frequencyArray(frequencyArray==0)=[];
end
Example Input Array : ans =
193974
2140429
2140432
2140437
2140442
2249750
2253106
2253106
2269479
2269980
2276359
2276359
2276365
2276365
2276359
2276359
2276365
2276359
2276359
2276359
2276359
2276359
2278750
2282743
2282756
How can I optimize the above algorithm? It uses arrays with around 1E6 Elements and takes around 15-20sec. Any comments are appreciated, thank you for your time!

Accepted Answer

Sean de Wolski
Sean de Wolski on 21 Mar 2012
Don't do this:
% Update Index And Data
inputArray(tfCluster)=[];
index(tfCluster)=[];
Changing the size of a big array takes time. Lots of it. So save the parts you need to get rid of and get rid of them at the end. Or set them to nan or something else your program can ignore.

More Answers (1)

go9kata
go9kata on 22 Mar 2012
Thank you for the reply. I took your advice and tried to change the strategy. I came up with the following function but it is even slower than before:
% Function Input
inputSize=size(inputArray,1);
index=(1:inputSize)';
clusterArray=zeros(inputSize,1);
% Get Only Unique Elements And Their Frequency
uniqueArray=unique(inputArray);
frequencyArray=histc(inputArray,uniqueArray);
% Sort Unique Elements Based On Frequency
[frequencyArray,sortIndex]=sort(frequencyArray,'descend');
uniqueArray=uniqueArray(sortIndex);
usedArray=true(size(uniqueArray,1),1);
% Maximum Possible Number Of Loops
nLoops=sum(frequencyArray>1);
% Allocate Some Variables
clusterFrequency=zeros(nLoops,1);
cluster=0;
pos=1;
while pos<nLoops
% Current Position
pos=find(usedArray,1,'first');
% Current Most Frequent Value
mfv=uniqueArray(pos);
% Check If It Is Clustered
if all(clusterArray(inputArray==mfv)==0)
% Update Cluster Id
cluster=cluster+1;
% Assign Value To Cluster Array
tfCluster=inputArray>=mfv-10&inputArray<=mfv+10;
clusterArray(tfCluster)=cluster;
clusterFrequency(cluster)=sum(tfCluster);
% Update Which Of The Unique Values Were Already Clustered
tfUsed=uniqueArray>=mfv-10&uniqueArray<=mfv+10;
usedArray(tfUsed)=false;
end
end
clusterFrequency(clusterFrequency==0)=[];
I know that the "if" clause is making a delay. But I could not really avoided it. Any suggestions are appreciated!

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!