Statistics Toolbox includes a lot of impressive new functionality for regression analysis.
I'm attach code for a blog post that will come out in a couple weeks.
X = linspace(1,100,50);
X = X';
Y = 7*X + 50 + 30*randn(50,1);
New_X = 100 * rand(10,1);
myFit = LinearModel.fit(X,Y)
The first line shows the linear regression model. When you perform a regression you need to specify a model that describes the relationship between our variables. By default, LinearModel assumes that you want to model the relationship as a straight line with an intercept term. The expression "y ~ 1 + x1" describes this model. Formally, this expression translates as "Y is modeled as a linear function which includes an intercept and a variable". Once again note that we are representing a model of the form Y = mX + B...
The next block of text includes estimates for the coefficients, along with basic information regarding the reliability of those estimates.
Finally, we have basic information about the goodness-of-fit including the R-square, the adjusted R-square and the Root Mean Squared Error.
Notice that this simple command creates a plot with a wealth of information including
- A scatter plot of the original dataset
- A line showing our fit
- Confidence intervals for the fit
MATLAB has also automatically labelled our axes and added a legend.
My plot looks like random noise - which in this case is a very good thing.
Here, once again, the lack of any noticable pattern in the residuals suggests a good fit. If the residuals suggested a line or a cigar shaped pattern this would suggest autocorrelation.
Cook's Distance is a metric that is commonly used to see whether a dataset contains any outliers. For any given data point, Cook's Distance is calculated by performing a brand new regression that excludes that data point. Cook's distance measures how much the shape of the curve changes between the two fits. If the curve moves by a large amount, that data point has a great deal of influence on the model and might very well be an outlier.
- The red crosses show the Cook's Distance for each point in the data set.
- The horizontal line shows "Three times the average Cook's Distance for all the points in the data set". Data points whose Cook's Distance is greater than three times the mean are often considered possible outliers.
In this example, none of our data points look as if they are outliers.
Predictions = predict(myFit, New_X)
X1 = 100 * randn(100,1);
X2 = 100 * rand(100,1);
X = [X1, X2];
Y = 3*X1.^2 + 5*X1.*X2 + 7* X2.^2 + 9*X1 + 11*X2 + 30 + 100*randn(100,1);
myFit2 = LinearModel.fit(X,Y)
Let's take a look at the output from this example. We can see, almost immediately, that something has gone wrong with our fit.
- The R^2 value is pretty bad
- The regression coefficients are nowhere near the ones we specified when we created the dataset
If we look at the line that describes the linear regression model we can see what went wrong. By default, LinearModel is fitting a plane to the dataset. (In our intial example, we had a single preditor, so LinearModel defaulted to a line. here we have two predictors, so LinearModel is defaulting to a plane). However, we "know" that the the true relationship between X and Y should be modelled with a high order polynomial. We need to pass this additional piece of information to "LinearModel".
Modeling a high order polynomial (Option 1)
Here are a couple different ways that I can use LinearModel to model a high order polynomial. The first option is to write out the formula by hand.
myFit2 = LinearModel.fit(X,Y, 'y ~ 1 + x1^2 + x2^2 + x1:x2 + x1 + x2')
myFit2 = LinearModel.fit(X, Y, 'poly22')
X = linspace(0, 6*pi, 90);
X = X';
Y = 10 + 3*(sin(1*X + 5)) + .2*randn(90,1);
myFit3 = NonLinearModel.fit(X,Y, 'y ~ b0 + b1*sin(b2*x + b3)', [11, 2.5, 1.1, 5.5])
plot(X, myFit3.Fitted, 'r')
myFit4 = NonLinearModel.fit(X,Y, @(b,x)(b(1) + b(2)*sin(b(3)*x + b(4))), [11, 2.5, 1.1, 5.5])
ds = dataset(MPG,Weight);
ds.Year = ordinal(Model_Year);
mdl = LinearModel.fit(ds,'MPG ~ Year + Weight^2')