How do clustering algorithms handle non-numeric or categorical data and is it possible to assign weights to individual features (columns in the data) during clustering?

26 views (last 30 days)
Do the k-means and hierarchical clustering algorithms handle non-numeric data?
If not is there anyway of handling categorical data in clustering?
Also, is it possible to assign feature weights in hierarchical clustering / k-means clustering?

Accepted Answer

MathWorks Support Team
MathWorks Support Team on 19 Nov 2019
1) Do the k-means and hierarchical clustering algorithms handle non-numeric data? If not is there anyway of handling categorical data in clustering?
None of these algorithms take 'non-numeric' features as inputs and you will need to somehow convert the 'categorical' features into 'numeric'.
If you will try to call these functions on categorical features, MATLAB will show an error. Consider the following example:
>> data = [1:10; 1:10; 1:10]';
>> weights = [1 2 3];
>> weightedData = weights .* data;
Error using internal.stats.linkagemex
Function linkagemex only supports input of class 'double' or 'single'.
Error in linkage (line 259)
Z = internal.stats.linkagemex(Y,method,pdistArg, memEff);
Here, the error clearly indicates that the hierarchical clustering algorithm can only accept numeric data (i.e., 'double' or 'single' data types).
You will get a similar error for the 'kmeans' function as well (for example '>> idx = kmeans(X,3)' will produce ''Error using kmeans (line 166) Invalid data type. The first argument to KMEANS must be a real array.'').
In order to use categorical features for clustering, you need to 'convert' the categories you have into numeric types (say 'double') and the distance function you will use to define the dissimilarity of the data will be based on the 'double' representation of the categorical data. Please take a look at the following link for a descriptive example :
*2) Is it possible to assign feature weights in hierarchical clustering / k-means clustering? *
There is no built in option (or way) for assigning feature weights in any of the clustering algorithms. However, you can use the 'kmediods' clustering as an alternative and define a custom 'Distance' function, where you can 'weigh' the input features as per your requirements. Please refer to the following link for an example of specifying the 'Distance' property for 'k-mediods' clustering:
Here, you will need to use custom pairwise distance function 'pdist'. Please refer to the following link for an example of defining a custom 'pdist' function:
You can define your own MATLAB function like the 'naneucdist' function defined in the above link and add weights to the features as per your requirements.
Alternatively, if you have numerical features and an array of weights for each of these features, you can simply multiply the features with these weights. Consider the following example, where we have a dataset 'data' with three features:
>> data = [1:10; 1:10; 1:10]';
>> weights = [1 2 3];
>> weightedData = weights .* data;
Now, you can use the 'weightedData' for clustering as per your requirements.

More Answers (0)

Products


Release

R2019a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!