## Feature Selection QUBO (Quadratic Unconstrained Binary Optimization)

**Note**

**Installation Required:** This functionality requires MATLAB Support Package for Quantum Computing.

### Feature Selection Overview

Feature selection is a dimensionality reduction technique used to select a subset of features (predictor variables) that provide the best predictive power in modeling a set of data for classification or regression.

Feature selection can be used to:

Prevent overfitting — Avoid modeling with an excessive number of features that are more susceptible to rote-learning-specific training examples.

Reduce model size — Increase computational performance with high-dimensional data or prepare a model for embedded deployment when memory might be limited.

Improve interpretability — Use fewer features, which might help identify those that affect model behavior.

Based on Mücke, Heese, Müller, Wolter, and Piatkowski [1], you can perform
feature selection by using a QUBO model to select relevant predictors. Specifically, given a
set of *N* scalar observations *y*(*i*),
where the observations are associated with *p* variables (the predictors)
*x*(*i*,*j*) for *i* =
1 through *N* and *j* = 1 through *p*.
The problem is to find a subset of the *p* predictors such that the
resulting model gives accurate predictions of the observations.

### QUBO Data Model

As noted in [1], the suggested process for creating a QUBO data model is as follows:

Specify the input data as an

*N*-by-*p*matrix`X`

, where each row is one data point.Specify the response data as an

*N*-by-1 vector`Y`

, where`Y(i)`

is the response to the row`X(i,:)`

.Calculate the matrix

`R(i,j)`

for`i`

and`j`

from 1 through*p*as the mutual information between columns`i`

and`j`

of`X`

.`R(i,j)`

represents the redundancy between pairs of features. The mutual information is a nonnegative quantity that can be estimated by binning the data and counting entries in each bin to estimate the associated probabilities. For more information, see https://en.wikipedia.org/wiki/Mutual_information.Similarly, calculate the vector

`J(i)`

for`i`

from 1 through*p*as the mutual information between column`i`

of`X`

and the response variable`Y`

. The quantity`J(i)`

is nonnegative and represents the strength of the coupling between column`i`

and the response`Y`

.For each value

*a*from 0 through 1, define the QUBO as*Q*= (1 –_{a}*a*)*R*– diag(*aJ*).

Because each `R`

and `J`

entry is nonnegative, the
minimum of the QUBO problem is at the point [1,1,…,1] for *a* = 1 and the
point [0,0,…,0] for *a* = 0. In [1], Mücke proves that in
nontrivial problems, for each *k* from 0 through *p*,
there is a value *a* such that the number of nonzero entries in the
solution to *Q _{a}* is equal to

*k*. In other words, for each number of predictors

*k*, there is a QUBO problem

*Q*whose solution has exactly

_{a}*k*nonzero entries. These entries represent the data columns in

`X`

that best match the `Y`

data while not matching the
other data columns.### Create Feature Selection Model with Synthetic Data

Create synthetic data for feature selection and then create a QUBO model to select the features. To train the resulting regression model, you need a Statistics and Machine Learning Toolbox™ license.

Create a synthetic data set using the

`iSyntheticData1`

helper function.[N,p,X,Y] = iSyntheticData1;

View the number of data points and predictors.

N

N = 10000

p

p = 30

Select five features, the number of useful features in this example. In general, you can try to find how many useful features exist by cross-validating models with different numbers of features.

K = 5;

Create the

`R`

matrix and`J`

vector from the data by using 20 bins for the data, and estimating the mutual information by counting the number of entries in each bin. Use the`iBinXY`

,`iComputeMIXX`

, and`iComputeMIXY`

helper functions to create the matrix and vector.nbins = 20; [binnedXInfo,binnedYInfo] = iBinXY(X,Y,nbins); binnedX = binnedXInfo.binned; nbinsX = binnedXInfo.nbins; binnedY = binnedYInfo.binned; nbinsY = binnedYInfo.nbins; R0 = iComputeMIXX(binnedX,nbinsX); J = iComputeMIXY(binnedX,binnedY,nbinsX,nbinsY); % Scale $R$ to make $\alpha = 0.5$ correspond to equal redundancy and % relevance terms. R = R0/(K-1);

Find the appropriate value of

*a*for the QUBO*Q*= (1 –_{a}*a*)*R*– diag(*aJ*). To do so, solve for the number of nonzero elements in the solution by using the`howmany`

helper function, and then use`fzero`

to set that number equal to`K`

.fun = @(alpha)howmany(alpha,R,J) - K; alphasol = fzero(fun,[0 1]); [~,xsol] = howmany(alphasol,R,J);

View the selected predictors.

find(xsol.BestX)

ans = 6 8 9 11 13

Train a regression tree, first using the full set of 30 predictors and then using the selected set of five predictors. Specify a common holdout value of 0.2, and compare the cross-validation losses. First train the regression tree using all the predictors. Training a regression tree requires a Statistics and Machine Learning Toolbox license.

rng default mdl = fitrtree(X,Y,CrossVal="on",Holdout=0.2); kfoldLoss(mdl)

ans = 0.0041

rng default X2 = X(:,find(xsol.BestX)); mdl2 = fitrtree(X2,Y,CrossVal="on",Holdout=0.2); kfoldLoss(mdl2)

ans = 0.0032

Although both regression trees have a low loss value, the regression tree that uses only five of the 30 predictors has a lower loss value.

### Helper Functions

This code creates the `iSyntheticData1`

helper function.

function [N,p,X,y] = iSyntheticData1 rng default N = 10000; p = 30; useful = [6,8,9,11,13]; C = randn(p,p); R = corrcov(C'*C); X = mvnrnd(zeros(p,1),R,N); % Make features 15 to 19 highly correlated with useful features: % 15 -> 6 % 16 -> 8 % 17 -> 9 % 18 -> 11 % 19 -> 13 corrStd = 0.1; X(:,15:19) = X(:,useful) + corrStd*randn(N,5); noiseStd = 0.1; t = 0.5*cos(X(:,11)) + sin(X(:,9).*X(:,8)) + 0.5*X(:,13).*X(:,6) + noiseStd*randn(N,1); y = rescale(t,0,1); X = zscore(X); end

This code creates the `iBinXY`

helper function. Note that this helper
function uses the `iBinPredictors`

helper function.

function [binnedXInfo,binnedYInfo] = iBinXY(X,Y,nbins) binnedXInfo = iBinPredictors(X,nbins); binnedYInfo = iBinPredictors(Y,nbins); end

This code creates the `iComputeMIXX`

helper function. Note that this
helper function uses the `iComputeMIXIXJ`

helper function.

function Q = iComputeMIXX(binnedX,nbinsX) p = size(binnedX,2); Q = zeros(p,p); for i = 1:p for j = i+1:p Xi = binnedX(:,i); Xj = binnedX(:,j); nbinsXi = nbinsX(i); nbinsXj = nbinsX(j); Q(i,j) = iComputeMIXIXJ(Xi,Xj,nbinsXi,nbinsXj); end end Q = Q + Q'; end

This code creates the `iComputeMIXIXJ`

helper function.

function mi = iComputeMIXIXJ(Xi,Xj,nbinsXi,nbinsXj) N = size(Xi,1); NXY = zeros(nbinsXi,nbinsXj); for n = 1:N a = Xi(n); b = Xj(n); NXY(a,b) = NXY(a,b) + 1; end NX = sum(NXY,2); NY = sum(NXY,1); NXNYDivN = NX.*NY/N; nonzeroID = NXY ~= 0; nonzeroNXY = NXY(nonzeroID); nonzeroNXNYDivN = NXNYDivN(nonzeroID); mi = sum((nonzeroNXY./N).*log(nonzeroNXY./nonzeroNXNYDivN),'all'); end

This code creates the `iComputeMIXY`

helper function. Note that this
helper function uses the `iComputeMIXIXJ`

helper function.

function f = iComputeMIXY(binnedX,binnedY,nbinsX,nbinsY) p = size(binnedX,2); f = zeros(p,1); for i = 1:p Xi = binnedX(:,i); nbinsXi = nbinsX(i); f(i) = iComputeMIXIXJ(Xi,binnedY,nbinsXi,nbinsY); end end

This code creates the `howmany`

helper function.

function [n,result] = howmany(alpha,R,J) Q = qubo((1-alpha)*R - alpha*diag(J)); result = solve(Q); n = sum(result.BestX); end

This code creates the `iBinPredictors`

helper function. Note that this
helper function uses the `iDiscretize`

helper function.

function binnedInfo = iBinPredictors(X,nbins) [N,p] = size(X); binnedX = zeros(N,p); edgesX = cell(1,p); iscatX = false(1,p); nbinsX = zeros(1,p); istbl = istable(X); for i = 1:p if istbl oneX = X{:,i}; else oneX = X(:,i); end nbinsi = min(nbins, numel(unique(oneX))); [binnedX(:,i),edgesX{i},iscatX(i),nbinsX(i)] = iDiscretize(oneX,nbinsi); end binnedInfo = struct; binnedInfo.binned = binnedX; binnedInfo.edges = edgesX; binnedInfo.iscat = iscatX; binnedInfo.nbins = nbinsX; end

This code creates the `iDiscretize`

helper function.

function [binnedx,edges,iscat,nbins] = iDiscretize(x,nbins) if iscategorical(x) binnedx = double(x); edges = []; iscat = true; nbins = numel(unique(x)); else probs = [0,(1:(nbins-1))/nbins,1]; edges = quantile(x,probs); binnedx = discretize(x,edges,IncludedEdge="right"); iscat = false; end end

## References

[1] Mücke, S., R. Heese, S. Müller,
M. Wolter, and N. Piatkowski. *Quantum Feature Selection.*
arXiv:2203.13261v1, March 2022. Available at https://arxiv.org/pdf/2203.13261.pdf.

## See Also

`fitrtree`

(Statistics and Machine Learning Toolbox) | `solve`

| `qubo`

| `tabuSearch`