Why is variance high for high K value in this KNN code?

Question

Vanditha Rao on 19 Jul 2019

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/472553-why-is-variance-high-for-high-k-value-in-this-knn-code

Edited: Ganesh Regoti on 29 Jul 2019

Hello,

Long post, please bear with me

I have a matlab dataset (dataset.mat) whose size is 280*3. The last column is the labels. There are total 3 classes (1, 2 and 3). I am implementing KNN on this dataset. Basically, I want to calculate the classification error, the mean and the variance of the classification error over multiple (random, but even) splits. From the plot I want to determine how k value affects the mean and the variance of the classification error. Now, I understand the concept of Bias and Variance. I also know that as the k value increases, the bias will increase and variance will decrease. When K = 1 the bias will be 0, however, when it comes to new data (in test set), it has higher chance to be an error, which causes high variance. But, the variance isnt decreasing in my plot (please see the attachment)

My code looks like this:

%% Loading the dataset
clear all
clc
load('dataset.mat');
%% Calculating the mean, variance and classification error for multiple splits
m = []; % empty list to store the mean of the classification error
variance = []; % empty list to store the variance of the classification error
error = []; % empty list to store the classification error
for k= 1:20 % different k values
    
    error = [];
    
    for j= 1:10 % This for loop is for random split (note: each time it is split evenly i.e. 50% into a training set and rest in a test set). 
        
        
        % dataset is split evenly (i.e. 50%), but randomly in to a training set and a test set all 10 times
        
        N = size(knn_samples,1);
        idx = randperm(N);
        
        train = knn_samples(idx(1:round(N*0.5)),:);
        test = knn_samples(idx(round(N*0.5)+1:end),:);
        X_train = train(:,1:2); % size 140*2
        y_train = train(:,3); % size 140*1
        X_test = test(:,1:2); % size 140*2
        y_test = test(:,3); % size 140*1
       
        Model = fitcknn(X_train,y_train,'NumNeighbors',k,'Standardize',1); % KNN model
        
        rloss = resubLoss(Model); % the classification loss by resubstitution
        
        [label_test,score_test,cost_test] = predict(Model,X_test);
        L = loss(Model,X_test,y_test); %how well the model classifies the data 
        C_test = confusionmat(y_test,label_test); % confusion matrix 
        idx = find(C_test ~= diag(C_test)); %to find the index of the off diagonal entries of confusion matrix i.e. classification error
        off_diag = sum(C_test(idx)); %to calculate the total value of off diagonal entries
        accuracy = sum(diag(C_test)/sum(C_test(:)));
        
        errorClass = sum(label_test ~= y_test)/length(y_test);
        error = [error, errorClass]; % classification error
        
    end
    
    m = [m, mean(error)]; %mean of the classification error
    variance = [variance, var(error)]; % variance of the classification error
    
end
figure(1)
hold on
colormat1 = y_test;
scatter(X_test(:, 1), X_test(:, 2), [], colormat1); 
l = (label_test ~= y_test); % specify wrong predictions
colormat2 = label_test(l);
mkr = 'x';
scatter(X_test(l, 1), X_test(l, 2), [], colormat2, mkr); % mark the wrong predictions
k = 1:20;
 
figure(2)
plot(k, m, 'b')
xlabel('K values')
ylabel('Mean')
title('Mean of the classification error') % over multiple splits
figure(3)
plot(k, predictiveVariance, 'k')
xlabel('K values')
ylabel('Variance')
title('Variance of the classification error')

Maybe there is a compact way of writing this code, but I am a beginner. This could be a very very basic quetion, but I am unable to figure it out. I looked online for the solution, but I didn't find anything. Almost every site talks about Bias and Variance trade-off, but I didn't find any code example or a reason on why the variance could be increasing with increasing value of k. May be there is a small glitch in the code which I am unable to figure it out. I have given up on finding solution on my own, hence looking for solution in the Matlab community. You can also suggest a better way to write this code or any link which could give me a solution for this.

Note: Please also have a look at the variance value. Is it too small (it is in 10^-3 range)

Thank you very much

2 Comments
Show NoneHide None

Ganesh Regoti on 24 Jul 2019

Can you provide a section of dataset to test on the model?

Vanditha Rao on 28 Jul 2019

dataset.mat

@Ganesh Regoti: What do you mean by the section of dataset? Do you want me to attach the dataset? I have attached the dataset.

Sign in to comment.

Sign in to answer this question.

Answer 1

llueg on 24 Jul 2019

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/472553-why-is-variance-high-for-high-k-value-in-this-knn-code#answer_384606

I agree more information on the data would be helpful. Also, since your data set is fairly small, you can probably do more than 10 (maybe a hundred) different splits for each k, just to get a more accurate average. If the current trend is still there, it's probably due to properties specific to your data.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Answer 2

Ganesh Regoti on 29 Jul 2019

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/472553-why-is-variance-high-for-high-k-value-in-this-knn-code#answer_385203

Edited: Ganesh Regoti on 29 Jul 2019

In KNN-classification, variance need not be decreasing as the K value increases. Usually it is ‘U’- shape and we find out the optimal point.

There might be certain predictors which contribute more for the classification. If those highly contributing predictors vary as such

Constant: There will be not much difference in variance graph for the entire data set.

Values vary and reach an optimum at certain point: Variance also varies accordingly (probably decreasing with increase in K value) but once optimal point is reached, it might start increasing.

So, I think that in your case optimum point is reached in the process, and continuing the process lead to increase in variance.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Why is variance high for high K value in this KNN code?

2 Comments
Show NoneHide None

Answers (2)

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Community Treasure Hunt

Why is variance high for high K value in this KNN code?

2 Comments Show NoneHide None

Answers (2)

0 Comments Show -2 older commentsHide -2 older comments

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Community Treasure Hunt

2 Comments
Show NoneHide None

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments