Discover MakerZone

MATLAB and Simulink resources for Arduino, LEGO, and Raspberry Pi

Learn more

Discover what MATLAB® can do for your career.

Opportunities for recent engineering grads.

Apply Today

Thread Subject:
clustering 'locally' for large data set?

Subject: clustering 'locally' for large data set?

From: jay vaughan

Date: 25 May, 2010 22:45:20

Message: 1 of 10

Hi Folks,

I have a list of 10^3 to 10^6 positions that I would like to use for cluster analysis. I have been using the function clusterdata...

T = clusterdata(X,'distance','euclid','linkage','single','cutoff',...
   distance_threshold,'criterion','distance');

Not surprisingly, the analysis doesn't work beyond around 10^4 points, I suspect because of memory limitations in calculating pairwise distances between so many points.

I am wondering if there is another way to do this, especially considering that I do not need the whole cluster heierarchy, but only 'local clustering' within a radius that would only ever include a very small number of points of the whole data.

Any suggestions?

Thanks,
jay

I am running Windows XP and R2010a and have access to the toolboxes.

Subject: clustering 'locally' for large data set?

From: ImageAnalyst

Date: 26 May, 2010 02:50:30

Message: 2 of 10

Assuming your clusters are not all clustered (no pun intended), or at
least all the clusters are fairly well represented with the first
10,000 points, then why not just use the first 10,000 points? Does
including points 10,000 - 1,000,000 really improve your clustering
anyway (if you could do it)? Probably not.

Subject: clustering 'locally' for large data set?

From: jay vaughan

Date: 26 May, 2010 14:14:04

Message: 3 of 10

Thanks for the tip, ImageAnalyst. If I am unable to find a better solution, I may have to resign myself to tossing most of the data, as heartbreaking as that may be.

I guess I was hoping there may be other algorithms or approaches for efficient 'local' clustering of large data sets. For instance, sort the data into smaller, manageable, chunks, process with clusterdata, and then figure out how to handle the edge points.


ImageAnalyst <imageanalyst@mailinator.com> wrote in message <31092b43-472c-4aa7-9c9a-6ab8f7c69030@l6g2000vbo.googlegroups.com>...
> Assuming your clusters are not all clustered (no pun intended), or at
> least all the clusters are fairly well represented with the first
> 10,000 points, then why not just use the first 10,000 points? Does
> including points 10,000 - 1,000,000 really improve your clustering
> anyway (if you could do it)? Probably not.

Subject: clustering 'locally' for large data set?

From: Walter Roberson

Date: 27 May, 2010 05:05:33

Message: 4 of 10

jay vaughan wrote:

> I have a list of 10^3 to 10^6 positions that I would like to use for
> cluster analysis. I have been using the function clusterdata...

> T = clusterdata(X,'distance','euclid','linkage','single','cutoff',...
> distance_threshold,'criterion','distance');

> Not surprisingly, the analysis doesn't work beyond around 10^4 points, I
> suspect because of memory limitations in calculating pairwise distances
> between so many points.

> I am wondering if there is another way to do this, especially
> considering that I do not need the whole cluster heierarchy, but only
> 'local clustering' within a radius that would only ever include a very
> small number of points of the whole data.

The below assumes that the data is at least 2 dimensional, but could be
adapted for the case of 1D positions.

Use histc to bin along Y at a constant width and get the indices. You
will use the bins and related indices to extract subsets of the points.

Start at the top and proceed downwards.

Take a window of data at a time (i.e., extract the subset of points in
that Y interval.) Cluster on that subset. Extract all the clusters whose
centers are at least R (the radius) above the bottom of the window and
record those as found clusters.

Move the window down to have at least R overlap with the old window, and
repeat.


If you set your bin size to be R, and use two bins together as your
window, then the clusters to be recorded will be the ones whose center
would fall in the upper bin, and moving the window downward would
correspond to making the lower bin the upper one and taking the next
bin. Since you have already done the binning, this would save you from a
bunch of data extraction and recalculation. If you find that the
overhead of firing up for the clustering is too high and you have enough
free memory, you could use several bins in sequence, saving the clusters
whose centers would be in any but the final one, and re-use just that
final bin as the first bin of the new window.


Why discard the clusters that start below the bottom R? Because if they
start below the bottom R, then the radius R continues on in to the next
bin (whose data was not presented for clustering), and so there might be
additional points for that cluster (and the cluster center might shift).
Discarding such a cluster the first time is not a problem because the
next iteration around all of the points will be within the new window
(unless the statement about clustering within a radius is violated.)

Subject: clustering 'locally' for large data set?

From: jay vaughan

Date: 27 May, 2010 13:05:06

Message: 5 of 10

Walter Roberson <roberson@hushmail.com> wrote in message <y4nLn.32339$wV2.7729@newsfe23.iad>...
> jay vaughan wrote:
>
> > I have a list of 10^3 to 10^6 positions that I would like to use for
> > cluster analysis. I have been using the function clusterdata...
>
> > T = clusterdata(X,'distance','euclid','linkage','single','cutoff',...
> > distance_threshold,'criterion','distance');
>
> > Not surprisingly, the analysis doesn't work beyond around 10^4 points, I
> > suspect because of memory limitations in calculating pairwise distances
> > between so many points.
>
> > I am wondering if there is another way to do this, especially
> > considering that I do not need the whole cluster heierarchy, but only
> > 'local clustering' within a radius that would only ever include a very
> > small number of points of the whole data.
>
> The below assumes that the data is at least 2 dimensional, but could be
> adapted for the case of 1D positions.
>
> Use histc to bin along Y at a constant width and get the indices. You
> will use the bins and related indices to extract subsets of the points.
>
> Start at the top and proceed downwards.
>
> Take a window of data at a time (i.e., extract the subset of points in
> that Y interval.) Cluster on that subset. Extract all the clusters whose
> centers are at least R (the radius) above the bottom of the window and
> record those as found clusters.
>
> Move the window down to have at least R overlap with the old window, and
> repeat.
>
>
> If you set your bin size to be R, and use two bins together as your
> window, then the clusters to be recorded will be the ones whose center
> would fall in the upper bin, and moving the window downward would
> correspond to making the lower bin the upper one and taking the next
> bin. Since you have already done the binning, this would save you from a
> bunch of data extraction and recalculation. If you find that the
> overhead of firing up for the clustering is too high and you have enough
> free memory, you could use several bins in sequence, saving the clusters
> whose centers would be in any but the final one, and re-use just that
> final bin as the first bin of the new window.
>
>
> Why discard the clusters that start below the bottom R? Because if they
> start below the bottom R, then the radius R continues on in to the next
> bin (whose data was not presented for clustering), and so there might be
> additional points for that cluster (and the cluster center might shift).
> Discarding such a cluster the first time is not a problem because the
> next iteration around all of the points will be within the new window
> (unless the statement about clustering within a radius is violated.)

Hi Walter,

that sounds like just what I am looking for. I will give it a try in the next day or two and see how it goes.

thanks,
jay

Subject: clustering 'locally' for large data set?

From: Ting Su

Date: 27 May, 2010 14:50:15

Message: 6 of 10

Jay,
If hierarchical clustrering is not necessary what you need, you can try
function "kmeans" in the MATLAB Statistics toolbox. Function "kmeans" can be
used for data sets with much larger number of observations than the current
implementation of "clusterdata".

-Ting

"jay vaughan" <jvaughan5.nospam@gmail.com> wrote in message
news:hthju0$22m$1@fred.mathworks.com...
> Hi Folks,
>
> I have a list of 10^3 to 10^6 positions that I would like to use for
> cluster analysis. I have been using the function clusterdata...
>
> T = clusterdata(X,'distance','euclid','linkage','single','cutoff',...
> distance_threshold,'criterion','distance');
>
> Not surprisingly, the analysis doesn't work beyond around 10^4 points, I
> suspect because of memory limitations in calculating pairwise distances
> between so many points.
>
> I am wondering if there is another way to do this, especially considering
> that I do not need the whole cluster heierarchy, but only 'local
> clustering' within a radius that would only ever include a very small
> number of points of the whole data.
>
> Any suggestions?
>
> Thanks,
> jay
>
> I am running Windows XP and R2010a and have access to the toolboxes.
>

Subject: clustering 'locally' for large data set?

From: jay vaughan

Date: 28 May, 2010 16:12:23

Message: 7 of 10

Ting,

thanks for the suggestion. I read up on kmeans clustering but don't think it is right for my job. I don't actually know how many clusters I have, and if I understood the description correctly, kmeans uses number of clusters as an input.

best regards,
jay

"Ting Su" <Ting.Su@mathworks.com> wrote in message <htm0r8$cq6$1@fred.mathworks.com>...
> Jay,
> If hierarchical clustrering is not necessary what you need, you can try
> function "kmeans" in the MATLAB Statistics toolbox. Function "kmeans" can be
> used for data sets with much larger number of observations than the current
> implementation of "clusterdata".
>
> -Ting
>
> "jay vaughan" <jvaughan5.nospam@gmail.com> wrote in message
> news:hthju0$22m$1@fred.mathworks.com...
> > Hi Folks,
> >
> > I have a list of 10^3 to 10^6 positions that I would like to use for
> > cluster analysis. I have been using the function clusterdata...
> >
> > T = clusterdata(X,'distance','euclid','linkage','single','cutoff',...
> > distance_threshold,'criterion','distance');
> >
> > Not surprisingly, the analysis doesn't work beyond around 10^4 points, I
> > suspect because of memory limitations in calculating pairwise distances
> > between so many points.
> >
> > I am wondering if there is another way to do this, especially considering
> > that I do not need the whole cluster heierarchy, but only 'local
> > clustering' within a radius that would only ever include a very small
> > number of points of the whole data.
> >
> > Any suggestions?
> >
> > Thanks,
> > jay
> >
> > I am running Windows XP and R2010a and have access to the toolboxes.
> >
>

Subject: clustering 'locally' for large data set?

From: jay vaughan

Date: 28 May, 2010 16:40:27

Message: 8 of 10

Hi Walter,

thanks again for the tip. Although I am anything but an ace programmer, I managed to implement what you described for 1D slicing or 2D tiling of large numbers of points. Now I can get back to my research.

For ~500 points, slicing or tiling were comparable or a little slower than just using clusterdata. For ~5000 points, slicing or tiling were ~5x faster than clusterdata. For ~50,000 points, clusterdata failed, and slicing was ~2x faster than tiling for some reason. For ~500,000 points, tiling was quite slow (5 min) but was the only routine that didn't have a memory overload. If clusters were 'too large' relative to the window, some points were lost, although these appear as zeros in the cluster_assignment output so the user can be aware of it.

I included a testing routine and the slicing/tiling cluster functions below in case you or others want checking it out. Due to the lack of generality (e.g. it is hard coded for 2-dimensional data and a certain clustering routine) I didn't think it was worthy of the file exchange.


best regards,
jay

%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
clear all

%% input parameters
num_seed_pts = 100;
max_num_pts_per_cluster = 200;
cluster_spread = 0.002;

num_slices_per_dim = 20;
clust_dist_thr = cluster_spread;

%% generate test 2D cluster data and view it
seed_pts = [rand(num_seed_pts,1) rand(num_seed_pts,2)];
num_pts_per_cluster = round(max_num_pts_per_cluster*rand(num_seed_pts,1));
X = [];
ctr = 1;
for k = 1:num_seed_pts
    x = seed_pts(k,1) + cluster_spread*rand(num_pts_per_cluster(k),1);
    y = seed_pts(k,2) + cluster_spread*rand(num_pts_per_cluster(k),1);
    X(ctr:ctr+num_pts_per_cluster(k)-1,:) = [x y];
    ctr = ctr + num_pts_per_cluster(k);
end

figure(1)
plot(X(:,1),X(:,2),'.','markersize',1)
xlim([-0.1 1.1])
ylim([-0.1 1.1])

clear num_seed_pts seed_pts max_num_pts_per_cluster num_pts_per_clutser ctr k x y num_pts_per_cluster

%% do the cluster analysis
try % conventional (unsliced/untiled) cluster analysis
    tic
    cluster_assignments1 = clusterdata(X,'distance','euclid','linkage','single','cutoff',...
        clust_dist_thr,'criterion','distance');
    conv_time = toc;
catch
    conv_time = 0;
    cluster_assignments1 = zeros(size(X,1),1);
end

try % 1D sliced cluster analysis
    tic
    cluster_assignments2 = clusterdata_slices(X,num_slices_per_dim,clust_dist_thr);
    clust_1Dslices_time = toc;
catch
    clust_1Dslices_time = 0;
    cluster_assignments2 = zeros(size(X,1),1);
end

try % 2D, tiled cluster analysis
    tic
    cluster_assignments3 = clusterdata_tiles(X,num_slices_per_dim,clust_dist_thr);
    clust_2Dslices_time = toc;
catch
    clust_2Dslices_time = 0;
    cluster_assignments3 = zeros(size(X,1),1);
end

%% results
disp(['number of input points=' num2str(size(X,1)) '; slices per dim=' num2str(num_slices_per_dim)])
disp(['conventional time=' num2str(conv_time) 's; ' ...
    '1D slices time=' num2str(clust_1Dslices_time) 's; ' ...
    '2D tiling time=' num2str(clust_2Dslices_time) 's'])
disp(['clusterdata lost points=' num2str(sum(cluster_assignments1==0)) '; ' ...
    '1D slices lost points=' num2str(sum(cluster_assignments2==0)) '; ' ...
    '2D tiles lost points=' num2str(sum(cluster_assignments3==0))])
disp([' '])


%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function cluster_assignments = clusterdata_slices(X,num_slices,clust_dist_thr)

% The function clusterdata_tiles can perform cluster analysis on 2D data. It
% is based on clusterdata from the statistical toolbox but can handle larger
% input matrices. Note that the entire cluster heierarchy is not returned,
% but instead, the bottom level clusters based on a distance threshold provided
% by the user. num_slices should be chosen to be large enough that no one slice
% has so many points that it will overload clusterdata, but not so large that a
% slice is smaller than 4x the size of the largest expected cluster.
%
% The data is divided into num_slices slices. All clusters within a hard
% threshold of clust_dist_thr are identified within the slice but
% excluding clusters whose center positions are within 1/4 the height of
% a slice from the leading edge. The routine then advances by half a slice
% in order to pick up clusters from the edge of the previously analyzed
% region as well as clusters within the new region. The process is repeated.
%
% Unassigned points may occur when the cluster size is larger than 1/4 of a
% slice's height. num_unassigned_points = sum(cluster_assignments==0).
%
% SYNTAX: cluster_assignments = clusterdata_slices(X,num_slices,clust_dist_thr)
% X; % input from user, Nx2 array containing 2D position information
% num_slices = 5; % input from user, number of slices
% clust_dist_thr = 0.05;


% code starts here
num_grid_pieces = num_slices*2;
ygrid = linspace(min(X(:,2)),1.1*max(X(:,2)),num_grid_pieces+1);
grid_inc = ygrid(2)-ygrid(1);

remaining_X_inds = 1:size(X,1);
cluster_assignments = zeros(size(X,1),1);
ctr = 1;

% damn all these indices are confusing!
for m = 1:num_grid_pieces-1
    [n,bins_remaining_points] = histc(X(remaining_X_inds,2),ygrid);
    remaining_points_inds_in_curr_bin = find(bins_remaining_points==m | bins_remaining_points==m+1);
    if length(remaining_points_inds_in_curr_bin)>1
        Xsubset = X(remaining_X_inds(remaining_points_inds_in_curr_bin),:);
        T2 = clusterdata(Xsubset,'distance','euclid','linkage','single','cutoff',...
            clust_dist_thr,'criterion','distance');
        inds_to_dump = [];
        num_clusters = max(T2);
        for k = 1:num_clusters
            Xsubset_cluster_inds = find(T2(:)==k);
            
            % if the clutster is not at the edge of the bin, keep the
            % cluster and remove the points from the master list
            if or(mean(Xsubset(Xsubset_cluster_inds,2))<ygrid(m+2)-0.5*grid_inc,m==num_grid_pieces-1)
                clust_inds = remaining_X_inds(remaining_points_inds_in_curr_bin(Xsubset_cluster_inds));
                inds_to_dump = union(inds_to_dump,clust_inds);
                cluster_assignments(clust_inds) = ctr;
                ctr = ctr + 1;
            end
        end
        remaining_X_inds = setdiff(remaining_X_inds,inds_to_dump);
    end
end


%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function cluster_assignments = clusterdata_tiles(X,num_slices_per_dim,clust_dist_thr)

% The function clusterdata_tiles can perform cluster analysis on 2D data. It
% is based on clusterdata from the statistical toolbox and clusterdata_slices
% but can handle larger input matrices. Note that the entire cluster heierarchy
% is not returned, but instead, the bottom level clusters based on a distance
% threshold provided by the user. num_slices_per_dim should be chosen to be
% large enough that no one slice has so many points that it will overload
% clusterdata, but not so large that half a slice is smaller than 4x the size
% of the largest expected cluster.
%
% The data is divided into num_slices_per_dim^2 tiles. All clusters within a
% hard threshold of clust_dist_thr are identified within the tile but
% excluding clusters whose center positions are within 1/4 the of a tile from
% the leading edges. The routine then advances by half a tile in order to pick
% up clusters from the edge of the previously analyzed region as well as clusters
% within the new region. The process is repeated.
%
% Unassigned points may occur when the cluster size is larger than 1/4 of a
% tile's width/height. num_unassigned_points = sum(cluster_assignments==0).
%
% SYNTAX: cluster_assignments = clusterdata_slices(X,num_slices_per_dim,clust_dist_thr)
% X; % input from user, Nx2 array containing 2D position information
% num_slices_per_dim = 5; % input from user, number of slices
% clust_dist_thr = 0.05;


% code starts here
num_grid_pieces = num_slices_per_dim*2;
xgrid = linspace(min(X(:,1)),1.1*max(X(:,1)),num_grid_pieces+1);
grid_inc = xgrid(2)-xgrid(1);

remaining_X_inds = 1:size(X,1);
cluster_assignments = zeros(size(X,1),1);
ctr = 1;

h=waitbar(0,'tile clustering in progress');
% damn all these indices are confusing!
try
    for m = 1:num_grid_pieces-1
        [n,bins_remaining_points] = histc(X(remaining_X_inds,1),xgrid);
        remaining_points_inds_in_curr_bin = find(bins_remaining_points==m | bins_remaining_points==m+1);
        if length(remaining_points_inds_in_curr_bin)>1
            Xsubset = X(remaining_X_inds(remaining_points_inds_in_curr_bin),:);
            T2 = clusterdata_slices(Xsubset,num_slices_per_dim,clust_dist_thr);
            inds_to_dump = [];
            num_clusters = max(T2);
            for k = 1:num_clusters
                Xsubset_cluster_inds = find(T2(:)==k);
                
                % if the clutster is not at the edge of the bin, keep the
                % cluster and remove the points from the master list
                if or(mean(Xsubset(Xsubset_cluster_inds,1))<xgrid(m+2)-0.5*grid_inc,m==num_grid_pieces-1)
                    clust_inds = remaining_X_inds(remaining_points_inds_in_curr_bin(Xsubset_cluster_inds));
                    inds_to_dump = union(inds_to_dump,clust_inds);
                    cluster_assignments(clust_inds) = ctr;
                    ctr = ctr + 1;
                end
            end
            remaining_X_inds = setdiff(remaining_X_inds,inds_to_dump);
        end
        waitbar(m/(num_grid_pieces-1),h,['tile clustering in progress']);
    end
catch
end
close(h)

Subject: clustering 'locally' for large data set?

From: Walter Roberson

Date: 28 May, 2010 17:41:19

Message: 9 of 10

jay vaughan wrote:

Good that you got it going. I haven't had time to analyze the code. The
boundary conditions and amount you advance the slice or tile sounds a
bit wrong to me, but I'd have to think about it more.

> seed_pts = [rand(num_seed_pts,1) rand(num_seed_pts,2)];

As a trivial note compared to what you have done: the above line is
equivalent to

seed_pts = rand(num_seed_pts, 3);

which would be more efficient and less likely to run out of memory.

Subject: clustering 'locally' for large data set?

From: jay vaughan

Date: 28 May, 2010 18:35:24

Message: 10 of 10

Walter Roberson <roberson@hushmail.com> wrote in message <4fTLn.22587$Gx2.3077@newsfe20.iad>...
> jay vaughan wrote:
>
> Good that you got it going. I haven't had time to analyze the code. The
> boundary conditions and amount you advance the slice or tile sounds a
> bit wrong to me, but I'd have to think about it more.

yeah, i was lazy and just set the boundary to be 1/4 of the slice thickness and advanced by 1/2 the slice thickness. i agree it's not optimized, especially when the slice thickness is much much greater than the maximum expected cluster size. in that case i think it would be beneficial to have a smaller boundary and move the window by more than 1/2 the slice thickness. in the limit of an infinitesimal boundary region, there would be up to a 50% reduction in the number of slices per dimension. i also wondered about advancing the window to the next unclaimed point, or something like that, but saw a few issues and just went for the easier-to-code route.

best regards,
jay

Tags for this Thread

What are tags?

A tag is like a keyword or category label associated with each thread. Tags make it easier for you to find threads of interest.

Anyone can tag a thread. Tags are public and visible to everyone.

Contact us