Code covered by the BSD License  

Highlights from
(simple) Tool for estimating the number of clusters

(simple) Tool for estimating the number of clusters

by

Kaijun Wang (view profile)

 

10 Feb 2007 (Updated )

12 validity indices, illustrate estimation of the number of clusters

daisy(x,vtype,metric)
function result = daisy(x,vtype,metric)

%DAISY returns a matrix containing all the pairwise dissimilarities
%(distances) between observations in the dataset.  The original
%variables may be of mixed types.
%
%The calculation of dissimilarities is explained in:
%   Kaufman, L. and Rousseeuw, P.J. (1990),
%   "Finding groups in data: An introduction to cluster analysis",
%   Wiley-Interscience: New York (Series in Applied Probability and
%   Statistics), ISBN 0-471-87876-6.
%
% Required input arguments:
%   x : Data matrix (rows = observations, columns = variables)
%   vtype : Variable type vector (length equals number of variables)
%       Possible values are 1  Asymmetric binary variable (0/1)
%                           2  Nominal variable (includes symmetric binary)
%                           3  Ordinal variable
%                           4  Interval variable
%
% Optional input arguments:
%   metric : Metric to be used (default euclidian (eucli) or mixed (mixed))
%       Possible values are 'eucli' Euclidian (all interval variables)
%                           'manha' Manhattan
%                           'mixed' Mixed (not all interval variables)
%
% I/O:
%   result=daisy(x,vtype,'eucli')
%
% Example (subtracted from the referenced book)
%   load flower.mat
%   result=daisy(flower,[2 2 1 2 3 3 4]);
%
% The output of DAISY is a structure containing:
%   result.disv       : dissimilarities (read row by row from the
%                       lower dissimilarity matrix)
%   result.metric     : used metric
%   result.number     : number of observations
%
% This function is part of LIBRA: the Matlab Library for Robust Analysis,
% available at:
%              http://wis.kuleuven.be/stat/robust.html
%
% Written by Guy Brys and Wai Yan Kong (May 2006)

%Checking and filling in the inputs
if (nargin<2)
    error('Two input arguments required')
elseif (nargin<3)
    if (sum(vtype)~=4*size(x,2))
        metric = 'mixed';
        metr = 0;
    else
        metric = 'eucli';
        metr = 1;
    end
elseif (nargin==3)
    if strcmp(metric,'eucli')
        metr=1;
    elseif strcmp(metric,'manha')
        metr=2;
    elseif strcmp(metric,'mixed')
        metr=0;
    end
end

%Standardizing in case of mixed metric
if (sum(vtype)~=4*size(x,2))
    colmin = min(x);
    colextr = max(x)-colmin;
    x = (x - repmat(colmin,size(x,1),1))./repmat(colextr,size(x,1),1);
end

%Replacement of missing values
jtmd = repmat(1,1,size(x,2))-2*(sum(isnan(x))>0);
valmisdat = min(min(x))-0.5;
x(isnan(x)) = valmisdat;
valmd = repmat(valmisdat,1,size(x,2));

%Actual calculations
disv=daisyc(size(x,1),size(x,2),x,valmd,jtmd,vtype,metr);

%Putting things together
result = struct('disv',disv,'metric',metric,'number',size(x,1));


Contact us