|
On 3/22/2010 5:30 AM, Shirley Zheng wrote:
> Sorry, I am very new to Matlab and I know that its a very simple
> question but I don't really understand: For the regress() command, it
> says 'If the columns of X are linearly dependent, regress obtains a
> basic solution by setting the maximum number of elements of b to zero'.
> Can anybody explain what is '..setting the maximum number of elements of
> b to zero'?
When X has columns that are linearly dependent, there is no unique solution to the least squares problem -- in fact, there are an infinite number of solutions (that comes from basic linear algebra). REGRESS is based on MATLAB's backslash operator "\", and out of the infinite possible solutions, backslash returns the "basic solution", i.e., the one that has "as many zero coefficients as possible". By setting some of the coefs to zero, backslash in effect ignores the corresponding columns of X, and it can do that because they provide no additional information beyond that give by the other columns -- the ignored columns can be constructed as linear combinations of the others. If X has m columns and only q of them are linearly independent, then m-q coefs in b can be set to zero.
For example, construct a full column rank X, and a random y, and regress y on X:
>> n = 7;
>> X1 = [ones(n,1) randn(n,2)];
>> y = randn(n,1);
>> b1= regress(y,X1)
b1 =
-1.1733
-0.21145
-0.65243
Now add two columns to X that are linearly dependent on the existing columns, and regress y on that:
>> X2 = [X1 X1(:,1)+X1(:,3) X1(:,2)+X1(:,3)];
>> b2 = regress(y,X2)
Warning: X is rank deficient to within machine precision.
> In regress at 82
b2 =
-1.1733
-0.21145
-0.65243
0
0
Notice that REGRESS has returns exact zeros as the coefficients associated with those two new columns, i.e., it has ignored those two columns. Thus, the remaining coefs are the same as in the first regression. That usually does _not_ happen. This is more typical:
>> X1 = [ones(n,1) randn(n,2)];
>> b1= regress(y,X1)
b1 =
-0.47449
0.98807
-0.1507
>> X2 = [X1 X1(:,1)+X1(:,3) X1(:,2)+X1(:,3)];
>> b2 = regress(y,X2)
Warning: X is rank deficient to within machine precision.
> In regress at 82
b2 =
0.66429
0
0
-1.1388
0.98807
This time, REGRESS has ignored columns 2 and 3. That choice of where the zeros go is not based on anything statistically meaningful, it is simple choices that backslash makes based on numerical concerns.
In either case, if you multiply X1*b1 and X2*b2, you'll find that the fitted values are the same.
There are other possible ways to deal with co-linearity. One is to use PINV to get what's known as the "minimum norm solution". Another is to choose columns of X to ignore based on criteria that are more statistical. STEPWISEFIT, for example.
Hope this helps.
|