MATLAB Examples

Human Activity Classification based on Smartphone Sensor Signals

Cluster Analysis

Contents

Unsupervised Learning

In this section we will implement an unsupervised learning approach (outputs are not known) using the 'kmeans' algorithm

Clear all variables that are not relevant & load pre-saved variables

Clear nonrelevant variables

clear;clc

% Load set of feature vectors (feat) and cell array of feature names
% (featlabels)
load('Data\Prepared_iPhone_32\BufferFeatures60.mat')

% Run parallel pool
p = gcp('nocreate');
if isempty(p)
    p = parpool('local');
end

First Evaluate the optimal number of clusters using the Silouhette criterion

clust_opts = statset('UseParallel',1);
my_kmeans = @(X,K)(kmeans(X, K, 'emptyaction','singleton','distance', 'cityblock',...
    'replicates', 4,'options',clust_opts));

maxClust = 5;
which_clusters = evalclusters(feat,my_kmeans,'silhouette','KList',(1:maxClust));
disp(which_clusters);
nClust = which_clusters.OptimalK;
  SilhouetteEvaluation with properties:

    NumObservations: 24075
         InspectedK: [1 2 3 4 5]
    CriterionValues: [NaN 0.8659 0.8719 0.6665 0.6698]
           OptimalK: 3

K-Means Clustering

Perform K-means clustering to partition clusters in the data set. The city-block distance metric is used and the clustering is performed 5 times with different initial guesses (to prevent local minima solutions).

tic
[kidx,C] = my_kmeans(feat, nClust);
toc

% Visualize cluster identity using a colored pair-scatter plot on PCA scores
% for first 4 features
[~,score,~] = pca(feat);
pairscatterplain(score(:,1:4), featlabels(1:4), kidx);
Elapsed time is 4.038244 seconds.

Cluster Evaluation - Silhouette Value

Another metric for evaluating clusters is the silhouette value. This is a measure of how close each point in one cluster is to points in the neighboring clusters. This measure ranges from +1, indicating points that are very distant from neighboring clusters, through 0, indicating points that are not distinctly in one cluster or another, to -1, indicating points that are probably assigned to the wrong cluster.

The silhouette plot plots the sorted silhouette values for each point in a cluster, grouped by cluster. We also compute the average silhouette values in each case.

clusternames = cellstr([repmat('Cluster',nClust,1) num2str((1:nClust)')]);

figure
[sil,~] = silhouette(feat, kidx, 'cityblock');
set(gca,'YTickLabel',clusternames);
title(sprintf('Average silhouette value: %0.2f', mean(sil)));

Find Discriminative Features

It can be helpful to find a small set of features that can highly discriminate between clusters. This can reduce the dimensionality of the problem and help in interpreting the clustering results as we will see later. To do that, we use a function that computes a paired t test for each feature for each cluster pair and uses the test statistic to rank the features.

[featIdx, pair] = featureScore(feat, kidx, 10);
disp('Top features for separation of clusters:');
for i = 1:size(pair,1)
    fprintf('%d and %d: ', pair(i,1), pair(i,2));
    fprintf('%s, ', featlabels{featIdx(i,:)});
    fprintf('\n');
end
Top features for separation of clusters:
1 and 2: BodyAccYCovZeroValue, BodyAccYPowerBand2, BodyAccYRMS, BodyAccYCovFirstValue, BodyAccXCovZeroValue, BodyAccXRMS, BodyAccXPowerBand2, BodyAccZRMS, BodyAccZPowerBand2, BodyAccZCovZeroValue, 
1 and 3: BodyAccYPowerBand2, BodyAccYRMS, BodyAccYCovZeroValue, BodyAccYCovFirstValue, BodyAccYSpectVal6, BodyAccYSpectVal2, BodyAccYSpectVal5, BodyAccYPowerBand1, BodyAccYSpectVal3, BodyAccYSpectVal1, 
2 and 3: BodyAccYPowerBand2, BodyAccYCovZeroValue, BodyAccYRMS, BodyAccYCovFirstValue, BodyAccXCovZeroValue, BodyAccYSpectVal6, BodyAccXRMS, BodyAccXPowerBand2, BodyAccYPowerBand1, BodyAccYSpectVal5, 

Attribute Meaning to Clusters

What do the clusters represent? One way of attributing meaning is to use the discriminative features to characterize each cluster. For example, if one cluster has a really high or low average for a feature a physical interpretation for that cluster can be made. Here, we calculate the mean value of each highly discriminative feature grouped by cluster and present it in a tabular and heatmap view for interpretation.

allF = unique(featIdx(:));
g = grpstats(feat(:,allF),kidx)';
featMeans = table(featlabels(allF), g, 'VariableNames', {'Feature', 'ClusterMeans'});

heatmap(g,1:nClust,featlabels(allF),'%0.1f','gridlines','-','showAllTicks',true,'Textcolor','w');
colormap copper
colorbar
xlabel Cluster

Build complex digraph showcasing the 5 most important features which dissociate activities

[connections,nodeOutmap,nodeFeatmap] = build_connections(pair,featIdx);
ImportantFeat = cellstr(categorical(unique(featIdx),(1:numel(featlabels)),featlabels));
G = table([connections(:,1) connections(:,2)],ones(size(connections,1),1),repmat({'t-test Interaction'},size(connections,1),1),...
    'VariableNames',{'EndNodes' 'Weight' 'Code'});
G = graph(G,table([clusternames;ImportantFeat],'VariableNames',{'Name'}));
colormap hsv
nColors = degree(G);
nSizes = 6*sqrt(nColors-min(nColors)+0.2);
p = plot(G,'MarkerSize',nSizes,'NodeCData',nColors,'EdgeAlpha',0.1,'Layout','force');
set(gca,'XColor','w','YColor','w'); box off

Visualize boundaries for the 2 most important features

Imp = unique(featIdx(:));
x1 = min(feat(:,Imp(1))):0.1:max(feat(:,Imp(1)));
x2 = min(feat(:,Imp(3))):0.1:max(feat(:,Imp(3)));
[x1G,x2G] = meshgrid(x1,x2);
XGrid = [x1G(:),x2G(:)]; % Defines a fine grid on the plot

% Assigns each node in the grid to the closest centroid
idx2Region = kmeans(XGrid,nClust,'MaxIter',1,'Start',C(:,Imp(1:2)));

figure;
gscatter(XGrid(:,1),XGrid(:,2),idx2Region,...
    [0,0.75,0.75;0.75,0,0.75;0.75,0.75,0],'..');
hold on;
plot(feat(:,Imp(1)),feat(:,Imp(3)),'k*','MarkerSize',5);
title 'Acceleration Feature data';
xlabel(featlabels(Imp(1)));
ylabel(featlabels(Imp(2)));
legend([clusternames' 'Data'],'Location','Best');
axis tight
hold off;
Warning: Failed to converge in 1 iterations.