sequentialfs with "dummified" input feature matrix

8 views (last 30 days)
All my features are categorical and thus I converted each feature into dummy variables using:
So a M x 1 feature vector with N categories (held within a table column) will be converted to a M x N matrix where each row is a 0 1 vector (e.g [0 0 1]) representing a given category (still held within a table column). This allowed my to convert my table of now dummified features into a matrix using the code below to train an SVM (via fitcsvm for example or I'm actually using libsvm's svmtrain). Each feature is no longer represented as by 1 column in the matrix, I was told that this ok and the training performs as intended.
XMatrix = cell2mat(table2cell(dummifiedXTable));
I am now pursuing feature selection using sequentialfs() which iteratively finds the most predictive features as so:
c = cvpartition(Y,'k',5);
opts = statset('display','iter');
inmodel = sequentialfs(@my_fun,dumXMat,dumY,'cv',c,'options',opts);
where my_fun is:
function [ criterion ] = my_fun_lib(trainX,trainY,testX,testY)
bestc = '1';
model = svmtrain(trainY, trainX,['-s 0 -t 0 -c ' bestc]);
criterion = sum(svmpredict(testY, testX, model) ~= testY);
My issue is that my input feature matrix X is in a form where each column doesn't represent a feature within itself which is what sequentialfs() expects. I've tried feeding sequentialfs() the feature table before converting into a matrix (when a column still represents a feature) & taking care of this conversion in the my_fun function but it appears that sequentialfs() only wants to accept a matrix.
How can I resolve this issue? Many many thanks for your help!

Accepted Answer

Ilya on 17 Aug 2015
To answer this particular question:
"Is there any way instead of selecting category choices like this to select initial features? For example, I have features Weight, Model_Year, Color, Owner_Name, and Owner_Country and I want to know which of these are most predictive, not which of their category choices are."
I don't know why you would want to do this because selecting categories is in principle more useful than selecting predictors. But if you insist on doing that, just dummify variables inside your my_fun_lib. This works as long as all your features are categorical. Here is an example:
%%Form a table with two categorical variables
load fisheriris
T = table(categorical(meas(:,1)),categorical(meas(:,2)));
%%Convert table into matrix
X = zeros(size(T));
for p=1:size(X,2)
X(:,p) = double(T{:,p});
%%Turn into a binary classification problem
y = strcmp(species,'setosa');
%%Run sequentialfs
c = cvpartition(y,'k',5);
opts = statset('display','iter');
inmodel = sequentialfs(@mycrit,X,y,'cv',c,'options',opts);
where mycrit looks like this:
function val = mycrit(Xtrain,Ytrain,Xtest,Ytest)
Ntrain = size(Xtrain,1);
X = dummyvar([Xtrain; Xtest]);
Xtrain = X(1:Ntrain,:);
Xtest = X(Ntrain+1:end,:);
obj = fitcsvm(Xtrain,Ytrain);
val = loss(obj,Xtest,Ytest);
Melissa McCoy
Melissa McCoy on 24 Aug 2015
Aw that makes sense - thank you! Last point that's really irking me: why does it select different feature sets each time I run it? Is it because each set is equally predictive (I get the same leave one out cross val error with each set) so it arbitrarily chooses one over the other each time its runs? Many thanks!

Sign in to comment.

More Answers (1)

Madhav Rajan
Madhav Rajan on 17 Aug 2015
I understand that you want to perform feature selection using 'sequentialfs' with a 'dummified' input feature matrix. Assuming that you have the statistical and machine learning toolbox, you can refer the following example, where I have used the sample 'carsmall' dataset available in MATLAB.
The example uses two features, one variable is the numerical 'Weight' variable which is the weight of the cars. The other feature is a Mx1 'Model_Year' which is dummified into an Mx3 matrix since there are three categorical values of 'Model_Years'. The class variable is the 'Origin' which is either '0' if the country is 'USA' and '1' otherwise. 'svmtrain' and 'svmclassify' have been used in the my_fun file to model and classify the data.
The deviance formula in the example just sums up the misclassified points and is returned by 'my_fun'.
%%load the cars data set
load carsmall;
%%define y, x1, x2
y = categorical(cellstr(Origin),{'USA','France', 'Germany', 'Italy', 'Sweden', 'Japan'}, {'0', '1', '1' ,'1' ,'1', '1'} );
x1 = Weight;
x2 =dummyvar(nominal(Model_Year));
%%call sequential fs with the correct parameters
c = cvpartition(y,'k',10);
opts = statset('display','iter');
X = [x1 x2];
[fs,history] = sequentialfs(@my_fun, X,y, 'cv',c,'nullmodel',false,...
%%display the outputs
function [ dev ] = my_fun( XTRAIN,ytrain,XTEST,ytest)
%MY_FUN Summary of this function goes here
% Detailed explanation goes here
%%train the model and test it
SVMModel = svmtrain(XTRAIN,ytrain);
group = svmclassify(SVMModel, XTEST);
%%compute deviance
dev = sum(~strcmp(ytest, group));
Looking at your script, the ' svmtrain ' function appears to be called with incorrect parameters. You can refer the documentation for more details. The input to the 'my_fun.m' has to be a matrix and hence it is necessary to convert any table data which represents the feature variable to a matrix. You can refer the documentation of the ' sequentialfs ' function for more details on defining the criterion function 'my_fun.m'.
Hope this helps
  1 Comment
Melissa McCoy
Melissa McCoy on 17 Aug 2015
Edited: Melissa McCoy on 17 Aug 2015
Thank you for this thorough answer!
Per the incorrect parameters to svmtrain, my sincere apologies I did not specify that I'm using the libsvm library to implement the SVM (which as svmtrain() and svmpredict() functions) and I've verified that I've implemented this part correctly. I've also edited my stating of my my_fun.m function that I believe achieves the same output as yours.
Per features selection, in your implementation, your X will have 4 columns (the first representing your weight feature and the other 3 representing each of category choices of Model_Year). Therefore, sequentialfs() will return its selected column numbers - say 1 and 3. A few questions for you that I would love answers to:
  • How do you know which category choice of Model_Year column 3 refers to? I've tracked this by storing the values returned by nominal(featureVector) but I'm not certain that the order given here is the order of the dummy columns.
  • If I've divided my dataset into a validation set, training set, and test set that are all exclusive, would I perform this feature selection on my entire dataset, my validation dataset, or something else?
  • If the feature selection returned 1 and 3 and say 3 represents Ford1990, then you could build your SVM assuming there are just 2 features: "Weight" and "Is it a Ford1990 or not?", correct?
  • Is there any way instead of selecting category choices like this to select initial features? For example, I have features Weight, Model_Year, Color, Owner_Name, and Owner_Country and I want to know which of these are most predictive, not which of their category choices are.
  • Finally, each time I run sequentialfs(), I get different feature selections. According to this post , this is to be expected as some randomness is inherent in the process and I followed its suggestions to solve but it doesn't fix the issue. Have you come across the same issue?
Thanks so much for the help!!

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!