Why different runs of sequentialfs give different list of features

2 views (last 30 days)
I am running sequentialfs for feature selection. In different runs it provide different list of selected features. I have 11 features in the data set. Shall I run sequentialfs multiple times and include features which are included in 75% of runs?

Answers (1)

Walter Roberson
Walter Roberson on 27 Jul 2015
In the documentation for sequentialfs notice 'mcreps', the number of Monte-Carlo attempts. Monte-Carlo _always* implies randomness.
Now look at 'options' and see UseSubstreams and Streams, which talk about which random number generator to use. UseSubstreams is only meaningful if parallel processing is turned on, but Streams is used either way. Clearly randomness is part of the calculation.
If you want consistent output, initialize your random number generator.
Try increasing mcreps to have it automatically run multiple times.
  2 Comments
Melissa McCoy
Melissa McCoy on 16 Aug 2015
Edited: Melissa McCoy on 16 Aug 2015
Thanks for providing this info!
When you say initialize your random generator, do you mean something like this below. And I've increased mcreps to 5. But I'm still getting different features returned (it's looking through 345 features of 448 data entries - note the features are dummy variables from ~70 features with 4-5 categories in each).
c = cvpartition(dumY(:,1),'k',5);
stream = RandStream('mrg32k3a','Seed',5489);
opts = statset('display','iter','Streams',stream);
inmodel = sequentialfs(@my_fun_lib,XMat,dumY(:,1),'cv',c,'mcreps',5,'options',opts);
Can you advise on my error or ways to solve the issue? My features do have a quite a bit of missing not at random data which I've handled with adding an extra "Unsure" category to each and am not sure if this could be causing the issue.
Many thanks!
Mango Wang
Mango Wang on 29 Jun 2019
The asker may already not care about the answer. But I just put my thoughts here for following asker.
It's not Monte Carlo that determines the different result but the way you use cross validation does. Namely, cross validation has randomness and leads to different result. Because you call cvpartition before RandStream, namely, you initialize the random generator after already create cv object, the sequentialfs return different results. especially the case that your code is inside a function.
One way to avoid it is feed c as a numeric number rather than cvpartition object without the need to initialize the stream.
another way is to increase mcreps which will do a monte carlo repartition based on cvpartition.
But I guess a better way is to let it run thousands of times to get a really robust feature subset.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!