# Identifying Green Automotive Designs

In this demo, we will perform statistical analysis on automotive fuel economy data provided by the United States Environmental Protection Agency. We will see how the Statistics Toolbox™ can help us explore and analyze historical automotive performance data, and create a model that represent the data well. The demo will introduce the MATLAB® environment and showcase some of the newest features of the Statistics Toolbox.

## Contents

## Setup

Turn off warning (after saving current state)

warnState = warning('query', 'stats:dataset:setvarnames:ModifiedVarnames'); warning('off', 'stats:dataset:setvarnames:ModifiedVarnames');

## Load Data

Load fuel economy data from year 2000 to 2007 into a dataset array

data = loadData();

Reading in 1 of 7 Warning: Variable names were modified to make them valid MATLAB identifiers. Reading in 2 of 7 Warning: Variable names were modified to make them valid MATLAB identifiers. Reading in 3 of 7 Warning: Variable names were modified to make them valid MATLAB identifiers. Reading in 4 of 7 Warning: Variable names were modified to make them valid MATLAB identifiers. Reading in 5 of 7 Warning: Variable names were modified to make them valid MATLAB identifiers. Reading in 6 of 7 Warning: Variable names were modified to make them valid MATLAB identifiers. Reading in 7 of 7 Warning: Variable names were modified to make them valid MATLAB identifiers. Done

## Exploration

First, we would like to examine the distribution of the MPG data to get a better understanding of the variability. For this, we will use the Distribution Fitting Tool from the Statistics Toolbox. `dfittool` simplifies the task of visualizing and fitting distributions.

```
dfittool(data.mpg, [], [], 'MPG');
```

The histogram seems to have multiple peaks. Perhaps we need to break the data up into groups, "highway" and "city".

% close DFITTOOL (custom function) closeDTool; dfittool(data.mpg(data.C_H == 'H'), [], [], 'Highway'); dfittool(data.mpg(data.C_H == 'C'), [], [], 'City');

If we look at the "highway" data, the distribution still has multiple peaks. We can group it even further into "cars" and "trucks"

% close DFITTOOL (custom function) closeDTool; dfittool(data.mpg(data.C_H == 'H' & data.car_truck == 'C'), [], [], 'Highway Car'); dfittool(data.mpg(data.C_H == 'H' & data.car_truck == 'T'), [], [], 'Highway Truck');

We will try different types of distributions. We can do this by clicking on "New Fit..." from `dfittool`. "Normal" distribution doesn't seem to fit well. "Logistic" seems to be a better fit. It's not perfect but good enough for our purpose.

myDistFit(data.mpg(data.C_H == 'H' & data.car_truck == 'C'), ... data.mpg(data.C_H == 'H' & data.car_truck == 'T')) xlabel('MPG');

## Matrix Plot Visualization

We now know the overall variability in MPG, but what may be causing this? Which factors affect MPG and how? To answer these questions, let's look at the relationship between MPG and few of the other variables.

We will first look at a few of these variables: the engine displacement, the horsepower, and weight. The command `gplotmatrix` creates a matrix plot grouped by categories. We will categorize by City/Highway and Car/Truck.

carType = {'C', 'MPG_{car}';'T', 'MPG_{truck}'}; figPos = [.2 .55 .6 .45;.2 .1 .6 .45]; predNames = {'Displacement', 'Horsepower', 'Weight'}; for id = 1:2 dataID = data.car_truck == carType{id, 1}; response = data.mpg(dataID); respNames = carType(id, 2); predictors = [data.cid(dataID), data.rhp(dataID), data.etw(dataID)]; figure('units', 'normalized', 'outerposition', figPos(id, :)); [h, ax] = gplotmatrix(predictors, response, data.C_H(dataID), ... [], [], 5, [], [], predNames, respNames); title('Emissions Test Results'); end

All 3 variables seem to have negative correlations to MPG. As expected, highway driving seems to have a better fuel economy than city driving.

## Modeling Fuel Economy

We will now focus on one of the four groups, Highway-Cars, and create a model of MPG based on a few predictors.

We will look at the Highway Cars and test out 6 potential predictors: engine displacement, rated horsepower, estimated test weight, compression ratio, axle ratio, and engine/vehicle speed ratio. I will use ANOVA to analyze.

The model I will check first is a linear model:

dat = data(data.C_H == 'H' & data.car_truck == 'C', :); % List of potential predictor variables predictors = {dat.cid, dat.rhp, dat.etw, dat.cmp, dat.axle, dat.n_v}; varNames = {'cid', 'rhp', 'etw', 'cmp', 'axle', 'n_v'}; p = anovan(dat.mpg, predictors, 'continuous', 1:length(predictors), ... 'varnames', varNames);

The anova table informs us of the sources of variability in the model. Looking at the p-value (last column) can tell us whether or not a certain term has a significant effect in the model. We will remove any terms that is not significant to the model. In this case, the axle ratio seems to be insignificant. I will rerun ANOVA with up to 3-way interaction terms included in the model. The new model is

predictors(p > 0.05) = ''; varNames(p > 0.05) = ''; [p2, t, stats, terms] = anovan(dat.mpg, predictors, ... 'continuous', 1:length(predictors), 'varnames', varNames, 'model', 3);

Finally, we will perform a regression with only the significant terms

terms(p2 > 0.05, :) = []; s = regstats(dat.mpg, cell2mat(predictors), ... [zeros(1, length(predictors)); terms], {'beta', 'yhat', 'r', 'rsquare'}); home; r_square = s.rsquare

r_square = 0.8478

The R squared value denotes how much of the variation seen in MPG is explained by the model. Let's visually inspect the goodness of the regression.

```
myPlot(dat.mpg, s.yhat, s.rsquare, 'Highway-Car');
```

## Create model for other combinations

We will use the above method to create a model for the other three combinations: Highway-Truck, City-Car, City-Truck. The following function `modelMPG` is simply a compilation of the few cells above for creating a model for MPG.

**Highway Truck**

modelMPG(data, 'Highway', 'Truck');

r_square = 0.7390

**City Car**

modelMPG(data, 'City', 'Car');

r_square = 0.8735

**City Truck**

modelMPG(data, 'City', 'Truck');

r_square = 0.7936

## Simulation

Now that we have a model, the final step is to simulate other scenarios. Let's look at the 2007 HONDA ACCORD.

home; idx = (dat.yr == '2007' & dat.mfrName == 'HONDA' & dat.carline == 'ACCORD'); hondaAccord = dat(idx, {'yr', 'mfrName', 'carline', 'mpg'})

hondaAccord = yr mfrName carline mpg 2007 HONDA ACCORD 43.6 2007 HONDA ACCORD 43.1 2007 HONDA ACCORD 43.3 2007 HONDA ACCORD 36.6 2007 HONDA ACCORD 38.5 2007 HONDA ACCORD 35.7 2007 HONDA ACCORD 38.1 2007 HONDA ACCORD 38.2 2007 HONDA ACCORD 37.4

Let's compare this to what the model gives us. We will call `simMPG` to simulate with the appropriate input arguments.

vars = dat(idx, {'cid', 'rhp', 'etw', 'cmp', 'axle', 'n_v'}); hondaAccord_model = simMPG(vars(:, p < 0.05), terms, s); hondaMPG = [hondaAccord.mpg'; hondaAccord_model']; fprintf('\n\nModel validation (mpg):\n\n'); fprintf('Actual Model Diff\n'); fprintf('%6.2f %6.2f %6.2f\n', [hondaMPG; diff(hondaMPG)]);

Model validation (mpg): Actual Model Diff 43.60 38.90 -4.70 43.10 38.90 -4.20 43.30 37.18 -6.12 36.60 35.06 -1.54 38.50 35.62 -2.88 35.70 35.06 -0.64 38.10 35.62 -2.48 38.20 34.63 -3.57 37.40 34.63 -2.77

The model gives similar values. Now, we will simulate the fuel economy for a design where the engine displacement is decreased by 20%.

% Decrease displacement by 20% vars.cid = vars.cid * 0.8; hondaAccord_model2 = simMPG(vars(:, p < 0.05), terms, s); hondaMPG2 = [hondaAccord_model'; hondaAccord_model2']; fprintf('\n\nModel data (mpg):\n\n'); fprintf('Current Smaller Disp Diff %%Increase\n'); fprintf('%6.2f %6.2f %6.2f %6.2f\n', ... [hondaMPG2; diff(hondaMPG2); diff(hondaMPG2)./hondaMPG2(1,:)*100]);

Model data (mpg): Current Smaller Disp Diff %Increase 38.90 40.39 1.49 3.83 38.90 40.39 1.49 3.83 37.18 38.69 1.50 4.05 35.06 36.27 1.22 3.47 35.62 36.97 1.35 3.79 35.06 36.27 1.22 3.47 35.62 36.97 1.35 3.79 34.63 36.00 1.37 3.95 34.63 36.00 1.37 3.95

Compared to the current configuration, the design with smaller displacement would result in a slightly better fuel economy.

## Conclusion

We can now use this model for simulating different scenarios to come up with recommendations for a new automobile design.

warning(warnState);