Code covered by the BSD License  

Highlights from
Introduction to Statistics Toolbox - July 2007 Webinar Files

image thumbnail

Introduction to Statistics Toolbox - July 2007 Webinar Files

by

 

17 Jul 2007 (Updated )

Presentation and demo files for the "Introduction to Statistics Toolbox™" Webinar (July 12, 2007)

Identifying Green Automotive Designs

Identifying Green Automotive Designs

In this demo, we will perform statistical analysis on automotive fuel economy data provided by the United States Environmental Protection Agency. We will see how the Statistics Toolbox™ can help us explore and analyze historical automotive performance data, and create a model that represent the data well. The demo will introduce the MATLAB® environment and showcase some of the newest features of the Statistics Toolbox.

Contents

Setup

Turn off warning (after saving current state)

warnState = warning('query', 'stats:dataset:setvarnames:ModifiedVarnames');
warning('off', 'stats:dataset:setvarnames:ModifiedVarnames');

Load Data

Load fuel economy data from year 2000 to 2007 into a dataset array

data = loadData();
Reading in 1 of 7
Warning: Variable names were modified to make them valid
MATLAB identifiers. 
Reading in 2 of 7
Warning: Variable names were modified to make them valid
MATLAB identifiers. 
Reading in 3 of 7
Warning: Variable names were modified to make them valid
MATLAB identifiers. 
Reading in 4 of 7
Warning: Variable names were modified to make them valid
MATLAB identifiers. 
Reading in 5 of 7
Warning: Variable names were modified to make them valid
MATLAB identifiers. 
Reading in 6 of 7
Warning: Variable names were modified to make them valid
MATLAB identifiers. 
Reading in 7 of 7
Warning: Variable names were modified to make them valid
MATLAB identifiers. 
Done

Exploration

First, we would like to examine the distribution of the MPG data to get a better understanding of the variability. For this, we will use the Distribution Fitting Tool from the Statistics Toolbox. dfittool simplifies the task of visualizing and fitting distributions.

dfittool(data.mpg, [], [], 'MPG');

The histogram seems to have multiple peaks. Perhaps we need to break the data up into groups, "highway" and "city".

% close DFITTOOL (custom function)
closeDTool;

dfittool(data.mpg(data.C_H == 'H'), [], [], 'Highway');
dfittool(data.mpg(data.C_H == 'C'), [], [], 'City');

If we look at the "highway" data, the distribution still has multiple peaks. We can group it even further into "cars" and "trucks"

% close DFITTOOL (custom function)
closeDTool;

dfittool(data.mpg(data.C_H == 'H' & data.car_truck == 'C'), [], [], 'Highway Car');
dfittool(data.mpg(data.C_H == 'H' & data.car_truck == 'T'), [], [], 'Highway Truck');

We will try different types of distributions. We can do this by clicking on "New Fit..." from dfittool. "Normal" distribution doesn't seem to fit well. "Logistic" seems to be a better fit. It's not perfect but good enough for our purpose.

myDistFit(data.mpg(data.C_H == 'H' & data.car_truck == 'C'), ...
  data.mpg(data.C_H == 'H' & data.car_truck == 'T'))
xlabel('MPG');

Matrix Plot Visualization

We now know the overall variability in MPG, but what may be causing this? Which factors affect MPG and how? To answer these questions, let's look at the relationship between MPG and few of the other variables.

We will first look at a few of these variables: the engine displacement, the horsepower, and weight. The command gplotmatrix creates a matrix plot grouped by categories. We will categorize by City/Highway and Car/Truck.

carType = {'C', 'MPG_{car}';'T', 'MPG_{truck}'};
figPos  = [.2 .55 .6 .45;.2 .1 .6 .45];
predNames  = {'Displacement', 'Horsepower', 'Weight'};
for id = 1:2
  dataID = data.car_truck == carType{id, 1};
  response   = data.mpg(dataID);
  respNames  = carType(id, 2);
  predictors = [data.cid(dataID), data.rhp(dataID), data.etw(dataID)];

  figure('units', 'normalized', 'outerposition', figPos(id, :));
  [h, ax] = gplotmatrix(predictors, response, data.C_H(dataID), ...
    [], [], 5, [], [], predNames, respNames);

  title('Emissions Test Results');
end

All 3 variables seem to have negative correlations to MPG. As expected, highway driving seems to have a better fuel economy than city driving.

Modeling Fuel Economy

We will now focus on one of the four groups, Highway-Cars, and create a model of MPG based on a few predictors.

We will look at the Highway Cars and test out 6 potential predictors: engine displacement, rated horsepower, estimated test weight, compression ratio, axle ratio, and engine/vehicle speed ratio. I will use ANOVA to analyze.

The model I will check first is a linear model:

$$mpg = a_0 + a_1x_1 + a_2x_2 + ... + a_6x_6$$

dat = data(data.C_H == 'H' & data.car_truck == 'C', :);

% List of potential predictor variables
predictors = {dat.cid, dat.rhp, dat.etw, dat.cmp, dat.axle, dat.n_v};
varNames   = {'cid', 'rhp', 'etw', 'cmp', 'axle', 'n_v'};
p = anovan(dat.mpg, predictors, 'continuous', 1:length(predictors), ...
  'varnames', varNames);

The anova table informs us of the sources of variability in the model. Looking at the p-value (last column) can tell us whether or not a certain term has a significant effect in the model. We will remove any terms that is not significant to the model. In this case, the axle ratio seems to be insignificant. I will rerun ANOVA with up to 3-way interaction terms included in the model. The new model is

$$mpg = a_0 + a_1x_1 + ... + a_5x_5 + a_6x_1x_2 + ... + a_{16}x_1x_2x_3 + ... + a_{25}x_3x_4x_5$$

predictors(p > 0.05) = '';
varNames(p > 0.05)   = '';

[p2, t, stats, terms] = anovan(dat.mpg, predictors, ...
  'continuous', 1:length(predictors), 'varnames', varNames, 'model', 3);

Finally, we will perform a regression with only the significant terms

terms(p2 > 0.05, :) = [];

s = regstats(dat.mpg, cell2mat(predictors), ...
  [zeros(1, length(predictors)); terms], {'beta', 'yhat', 'r', 'rsquare'});

home;
r_square = s.rsquare
r_square =
    0.8478

The R squared value denotes how much of the variation seen in MPG is explained by the model. Let's visually inspect the goodness of the regression.

myPlot(dat.mpg, s.yhat, s.rsquare, 'Highway-Car');

Create model for other combinations

We will use the above method to create a model for the other three combinations: Highway-Truck, City-Car, City-Truck. The following function modelMPG is simply a compilation of the few cells above for creating a model for MPG.

Highway Truck

modelMPG(data, 'Highway', 'Truck');
r_square =
    0.7390

City Car

modelMPG(data, 'City', 'Car');
r_square =
    0.8735

City Truck

modelMPG(data, 'City', 'Truck');
r_square =
    0.7936

Simulation

Now that we have a model, the final step is to simulate other scenarios. Let's look at the 2007 HONDA ACCORD.

home;
idx = (dat.yr == '2007' & dat.mfrName == 'HONDA' & dat.carline == 'ACCORD');
hondaAccord = dat(idx, {'yr', 'mfrName', 'carline', 'mpg'})
hondaAccord = 
    yr      mfrName    carline    mpg 
    2007    HONDA      ACCORD     43.6
    2007    HONDA      ACCORD     43.1
    2007    HONDA      ACCORD     43.3
    2007    HONDA      ACCORD     36.6
    2007    HONDA      ACCORD     38.5
    2007    HONDA      ACCORD     35.7
    2007    HONDA      ACCORD     38.1
    2007    HONDA      ACCORD     38.2
    2007    HONDA      ACCORD     37.4

Let's compare this to what the model gives us. We will call simMPG to simulate with the appropriate input arguments.

vars = dat(idx, {'cid', 'rhp', 'etw', 'cmp', 'axle', 'n_v'});
hondaAccord_model = simMPG(vars(:, p < 0.05), terms, s);

hondaMPG = [hondaAccord.mpg'; hondaAccord_model'];
fprintf('\n\nModel validation (mpg):\n\n');
fprintf('Actual    Model     Diff\n');
fprintf('%6.2f   %6.2f   %6.2f\n', [hondaMPG; diff(hondaMPG)]);

Model validation (mpg):

Actual    Model     Diff
 43.60    38.90    -4.70
 43.10    38.90    -4.20
 43.30    37.18    -6.12
 36.60    35.06    -1.54
 38.50    35.62    -2.88
 35.70    35.06    -0.64
 38.10    35.62    -2.48
 38.20    34.63    -3.57
 37.40    34.63    -2.77

The model gives similar values. Now, we will simulate the fuel economy for a design where the engine displacement is decreased by 20%.

% Decrease displacement by 20%
vars.cid = vars.cid * 0.8;

hondaAccord_model2 = simMPG(vars(:, p < 0.05), terms, s);

hondaMPG2 = [hondaAccord_model'; hondaAccord_model2'];
fprintf('\n\nModel data (mpg):\n\n');
fprintf('Current   Smaller Disp   Diff  %%Increase\n');
fprintf('%6.2f      %6.2f     %6.2f   %6.2f\n', ...
  [hondaMPG2; diff(hondaMPG2); diff(hondaMPG2)./hondaMPG2(1,:)*100]);

Model data (mpg):

Current   Smaller Disp   Diff  %Increase
 38.90       40.39       1.49     3.83
 38.90       40.39       1.49     3.83
 37.18       38.69       1.50     4.05
 35.06       36.27       1.22     3.47
 35.62       36.97       1.35     3.79
 35.06       36.27       1.22     3.47
 35.62       36.97       1.35     3.79
 34.63       36.00       1.37     3.95
 34.63       36.00       1.37     3.95

Compared to the current configuration, the design with smaller displacement would result in a slightly better fuel economy.

Conclusion

We can now use this model for simulating different scenarios to come up with recommendations for a new automobile design.

warning(warnState);

Contact us