7 views (last 30 days)

Hello,

Long post, please bear with me

I have a matlab dataset (dataset.mat) whose size is 280*3. The last column is the labels. There are total 3 classes (1, 2 and 3). I am implementing KNN on this dataset. Basically, I want to calculate the classification error, the mean and the variance of the classification error over multiple (random, but even) splits. From the plot I want to determine how k value affects the mean and the variance of the classification error. Now, I understand the concept of Bias and Variance. I also know that as the k value increases, the bias will increase and variance will decrease. When K = 1 the bias will be 0, however, when it comes to new data (in test set), it has higher chance to be an error, which causes high variance. But, the variance isnt decreasing in my plot (please see the attachment)

My code looks like this:

%% Loading the dataset

clear all

clc

load('dataset.mat');

%% Calculating the mean, variance and classification error for multiple splits

m = []; % empty list to store the mean of the classification error

variance = []; % empty list to store the variance of the classification error

error = []; % empty list to store the classification error

for k= 1:20 % different k values

error = [];

for j= 1:10 % This for loop is for random split (note: each time it is split evenly i.e. 50% into a training set and rest in a test set).

% dataset is split evenly (i.e. 50%), but randomly in to a training set and a test set all 10 times

N = size(knn_samples,1);

idx = randperm(N);

train = knn_samples(idx(1:round(N*0.5)),:);

test = knn_samples(idx(round(N*0.5)+1:end),:);

X_train = train(:,1:2); % size 140*2

y_train = train(:,3); % size 140*1

X_test = test(:,1:2); % size 140*2

y_test = test(:,3); % size 140*1

Model = fitcknn(X_train,y_train,'NumNeighbors',k,'Standardize',1); % KNN model

rloss = resubLoss(Model); % the classification loss by resubstitution

[label_test,score_test,cost_test] = predict(Model,X_test);

L = loss(Model,X_test,y_test); %how well the model classifies the data

C_test = confusionmat(y_test,label_test); % confusion matrix

idx = find(C_test ~= diag(C_test)); %to find the index of the off diagonal entries of confusion matrix i.e. classification error

off_diag = sum(C_test(idx)); %to calculate the total value of off diagonal entries

accuracy = sum(diag(C_test)/sum(C_test(:)));

errorClass = sum(label_test ~= y_test)/length(y_test);

error = [error, errorClass]; % classification error

end

m = [m, mean(error)]; %mean of the classification error

variance = [variance, var(error)]; % variance of the classification error

end

figure(1)

hold on

colormat1 = y_test;

scatter(X_test(:, 1), X_test(:, 2), [], colormat1);

l = (label_test ~= y_test); % specify wrong predictions

colormat2 = label_test(l);

mkr = 'x';

scatter(X_test(l, 1), X_test(l, 2), [], colormat2, mkr); % mark the wrong predictions

k = 1:20;

figure(2)

plot(k, m, 'b')

xlabel('K values')

ylabel('Mean')

title('Mean of the classification error') % over multiple splits

figure(3)

plot(k, predictiveVariance, 'k')

xlabel('K values')

ylabel('Variance')

title('Variance of the classification error')

Maybe there is a compact way of writing this code, but I am a beginner. This could be a very very basic quetion, but I am unable to figure it out. I looked online for the solution, but I didn't find anything. Almost every site talks about Bias and Variance trade-off, but I didn't find any code example or a reason on why the variance could be increasing with increasing value of k. May be there is a small glitch in the code which I am unable to figure it out. I have given up on finding solution on my own, hence looking for solution in the Matlab community. You can also suggest a better way to write this code or any link which could give me a solution for this.

Note: Please also have a look at the variance value. Is it too small (it is in 10^-3 range)

Thank you very much

llueg
on 24 Jul 2019

Sign in to comment.

Ganesh Regoti
on 29 Jul 2019

Edited: Ganesh Regoti
on 29 Jul 2019

In KNN-classification, variance need not be decreasing as the K value increases. Usually it is ‘U’- shape and we find out the optimal point.

There might be certain predictors which contribute more for the classification. If those highly contributing predictors vary as such

Constant: There will be not much difference in variance graph for the entire data set.

Values vary and reach an optimum at certain point: Variance also varies accordingly (probably decreasing with increase in K value) but once optimal point is reached, it might start increasing.

So, I think that in your case optimum point is reached in the process, and continuing the process lead to increase in variance.

Sign in to comment.

Sign in to answer this question.

Opportunities for recent engineering grads.

Apply Today
## 2 Comments

## Direct link to this comment

https://www.mathworks.com/matlabcentral/answers/472553-why-is-variance-high-for-high-k-value-in-this-knn-code#comment_727781

⋮## Direct link to this comment

https://www.mathworks.com/matlabcentral/answers/472553-why-is-variance-high-for-high-k-value-in-this-knn-code#comment_727781

## Direct link to this comment

https://www.mathworks.com/matlabcentral/answers/472553-why-is-variance-high-for-high-k-value-in-this-knn-code#comment_729197

⋮## Direct link to this comment

https://www.mathworks.com/matlabcentral/answers/472553-why-is-variance-high-for-high-k-value-in-this-knn-code#comment_729197

Sign in to comment.