|
Hi all,
I'm looking for a little help optimizing some code. I'm
essentially implementing a K-nearest Neighbor Algorithm and
trying to find the Optimal K and Number of Binary Features.
I rank the features by correlation first but for this
example I'm just using random values. The trouble I'm
coming to is when I have multiple vectors with the same
pairwise distance I want to consider them as 1-Neighbor. I
use accumarry to accomplish this but the memory requirement
demolishes even my high-end computer. If anyone can see an
alternate implementation I would appreciate the help.
%normally FEATURES and CLASS are provided and are not random
%but for this example I'm using random data
%400 Samples with 200 observations each
FEATURES=rand(400,200)>0.5;
RESP=rand(400,1)>0.5; %randomly assign each to a CLASS
%Determine which pairwise comparisons are same-class or
different class
CLASS_MASK=~xor(repmat(RESP,[1
length(RESP)]),repmat(RESP',[length(RESP) 1]));
%Calculate the distance between each vector
BOTH_dist=xor(repmat(permute(FEATURES,[3 2
1]),[length(RESP) 1 1]),repmat(FEATURES,[1 1 length(RESP)]));
%Give SAME class comparisons POSITIVE distance and OPOSITE
class NEGATIVE distance.
BOTH_dist=BOTH_dist.*(repmat(permute(CLASS_MASK,[1 3 2]),[1
size(FEATURES,2) 1])-repmat(permute(~CLASS_MASK,[1 3 2]),[1
size(FEATURES,2) 1]));
%Use cumsum to determine the distance up-to and including
%the FEATURE in Y
BOTH_dist=cumsum(BOTH_dist,2);
%Assign INF distance for the SELF-CAMPARISONS
BOTH_dist=permute(BOTH_dist,[1 3
2])+repmat(diag(Inf(length(RESP),1),0),[1 1 size(FEATURES,2)]);
%Sort by the magnitude (because I've assigned negative
%distances) to determine the 'NearestNeighbors'
[junk INDS]=sort(abs(BOTH_dist),2);
%A helper index setup to reformat BOTH_dist
row_mask=repmat(repmat((1:length(RESP))',[1
length(RESP)]),[1 1 size(junk,3)]);
face_mask=repmat(permute(repmat(1:size(FEATURES,2),[length(RESP)
1]),[3 1 2]),[length(RESP) 1 1]);
L_inds=sub2ind(size(junk),row_mask(:),INDS(:),face_mask(:));
%Now BOTH_dist is sorted by absolute magnitude
BOTH_dist=reshape(BOTH_dist(L_inds),size(junk));
%find the total distance up-to and including the
%'NearestNeighbor' in Y
BOTH_dist=cumsum(BOTH_dist,2);
%If I stop here then BOTH_dist has the summed distance
between up-to K-NearestNeighbors with up-to and inlcuding
feature F for sample M in at BOTH_dist(M,K,F). However
because these are Binary distances I have the problem where
many Samples have the same distance. So the K is actually
incorrect, it should consider things with the same distance
as 1-K. To solve this I use the next few lines ... which
are where the memory hogging comes in so BEWARE when running.
%I check which centers have increased the distance and then
cumsum along that dimension.
NEIGHBORS=cumsum(cat(2,ones(length(RESP),1,size(FEATURES,2)),diff(BOTH_dist,[],2)~=0),2);
%Now I need to sum everything with the same Neighbor Value
%if you use new_inds=[row_mask(:) NEIGHBORS(:) face_mask(:)]
then MATLAB chokes
new_inds=zeros(numel(row_mask),3);
new_inds(:,1)=row_mask(:);
new_inds(:,2)=NEIGHBORS(:);
new_inds(:,3)=face_mask(:);
%accumarray will now use the NeighborNumber to index where
to put the value in BOTH_dist. If two samples have the same
NeighborNumber then it will sum those values.
BOTH_dist=accumarray(new_inds,BOTH_dist(:),size(NEIGHBORS),[],NaN);
If anyone can propose another way (that avoids accummarry) I
would greatly appreciate it.
Thanks
|