Solving Data Management and Analysis Challenges using MATLAB and Statistics Toolbox
Demo file for the Data Management and Statistics Webinar. This demo requires the Statistics Toolbox and was created using MATLAB 7.7 (R2008b).
In this demo, we will see how we can take a set of data describing performance and characteristics of various cars, and organize, extract, and visualize useful information for further analysis.
Contents
Automobile Data
Now let's begin. We'll work with this MAT file which contains some automobile data.
clear;
load carbig
whos
Name Size Bytes Class Attributes Acceleration 406x1 3248 double Cylinders 406x1 3248 double Displacement 406x1 3248 double Horsepower 406x1 3248 double MPG 406x1 3248 double Model 406x36 29232 char Model_Year 406x1 3248 double Origin 406x7 5684 char Weight 406x1 3248 double cyl4 406x5 4060 char org 406x7 5684 char when 406x5 4060 char
This data set contains information regarding 406 different cars. There are different variables for each piece of information, and each row corresponds to the same car.
Dataset Object
Dataset objects allow you to organize information in a tabular format, and have structures very much like that of matrices. Each row represents the observations, or "cars" in this case, and the columns represent the variables, with the appropriate header names.
clc;
cars = dataset(Acceleration, Cylinders, Displacement, Horsepower, ...
MPG, Model, Model_Year, Origin, Weight)
cars =
Acceleration Cylinders Displacement Horsepower MPG Model Model_Year Origin Weight
12 8 307 130 18 [1x36 char] 70 USA 3504
11.5 8 350 165 15 [1x36 char] 70 USA 3693
11 8 318 150 18 [1x36 char] 70 USA 3436
12 8 304 150 16 [1x36 char] 70 USA 3433
10.5 8 302 140 17 [1x36 char] 70 USA 3449
10 8 429 198 15 [1x36 char] 70 USA 4341
9 8 454 220 14 [1x36 char] 70 USA 4354
8.5 8 440 215 14 [1x36 char] 70 USA 4312
10 8 455 225 14 [1x36 char] 70 USA 4425
8.5 8 390 190 15 [1x36 char] 70 USA 3850
17.5 4 133 115 NaN [1x36 char] 70 France 3090
11.5 8 350 165 NaN [1x36 char] 70 USA 4142
11 8 351 153 NaN [1x36 char] 70 USA 4034
...The summary function provides basic statistical information for each of the variables included in the dataset object. Notice that there are some missing values for Horsepower and MPG, denoted by NaNs.
clc summary(cars);
Acceleration: [406x1 double]
min 1st Q median 3rd Q max
8 13.7 15.5 17.2 24.8
Cylinders: [406x1 double]
min 1st Q median 3rd Q max
3 4 4 8 8
Displacement: [406x1 double]
min 1st Q median 3rd Q max
68 105 151 302 455
Horsepower: [406x1 double]
min 1st Q median 3rd Q max NaNs
46 75.5 95 130 230 6
MPG: [406x1 double]
min 1st Q median 3rd Q max NaNs
9 17.5 23 29 46.6 8
...We can index into a dataset object like a regular matrix.
clc cars(1:10, :)
ans =
Acceleration Cylinders Displacement Horsepower MPG Model Model_Year Origin Weight
12 8 307 130 18 [1x36 char] 70 USA 3504
11.5 8 350 165 15 [1x36 char] 70 USA 3693
11 8 318 150 18 [1x36 char] 70 USA 3436
12 8 304 150 16 [1x36 char] 70 USA 3433
10.5 8 302 140 17 [1x36 char] 70 USA 3449
10 8 429 198 15 [1x36 char] 70 USA 4341
9 8 454 220 14 [1x36 char] 70 USA 4354
8.5 8 440 215 14 [1x36 char] 70 USA 4312
10 8 455 225 14 [1x36 char] 70 USA 4425
8.5 8 390 190 15 [1x36 char] 70 USA 3850
We can access individual columns by referencing them by their names...
clc
cars(1:10, {'Origin', 'MPG', 'Weight'})
ans =
Origin MPG Weight
USA 18 3504
USA 15 3693
USA 18 3436
USA 16 3433
USA 17 3449
USA 15 4341
USA 14 4354
USA 14 4312
USA 14 4425
USA 15 3850
The dot-notation allows you to extract the whole content of a variable.
clc cars.Horsepower(1:10)
ans = 130 165 150 150 140 198 220 215 225 190
Dataset objects store meta information.
get(cars)
Description: ''
Units: {}
DimNames: {'Observations' 'Variables'}
UserData: []
ObsNames: {}
VarNames: {'Acceleration' 'Cylinders' 'Displacement' 'Horsepower' 'MPG' 'Model' 'Model_Year' 'Origin' 'Weight'}
We can add dataset descriptions as well as units for the variables.
clc cars = set(cars, 'Description', 'Performance and structural information of automobiles'); cars = set(cars, 'Units', {'m/s^2', '', 'mm', 'hp', 'mpg', '', '', '', 'kg'}); summary(cars)
Performance and structural information of automobiles
Acceleration: [406x1 double, Units = m/s^2]
min 1st Q median 3rd Q max
8 13.7 15.5 17.2 24.8
Cylinders: [406x1 double]
min 1st Q median 3rd Q max
3 4 4 8 8
Displacement: [406x1 double, Units = mm]
min 1st Q median 3rd Q max
68 105 151 302 455
Horsepower: [406x1 double, Units = hp]
min 1st Q median 3rd Q max NaNs
46 75.5 95 130 230 6
MPG: [406x1 double, Units = mpg]
...Categorical Arrays
Notice that some of the variables take on discrete values. For instance, the Cylinders, and Origin take on a unique set of values:
clc disp('Cylinders:'); unique(cars(:, 'Cylinders')) disp('Origin:'); unique(cars(:, 'Origin'))
Cylinders:
ans =
Cylinders
3
4
5
6
8
Origin:
ans =
Origin
England
France
Germany
Italy
...Categorical arrays provide significant memory savings. We will convert Cylinders to an ordinal array, which contains ordering information. The variable Origin will be converted to a nominal array, which does not store ordering.
clc Cylinders_cat = ordinal(Cylinders); Origin_cat = nominal(Origin); whos Cylinders* Origin*
Name Size Bytes Class Attributes Cylinders 406x1 3248 double Cylinders_cat 406x1 1178 ordinal Origin 406x7 5684 char Origin_cat 406x1 1366 nominal
Now, let's convert the variables of the dataset object.
cars.Cylinders = ordinal(cars.Cylinders); cars.Origin = nominal(cars.Origin);
Filtering
Dataset objects can be easily filtered by criteria.
For example, we can create a logical array that has ONEs where the origin is Germany and ZEROs where it's not Germany.
germanyMask = cars.Origin == 'Germany'
germanyMask =
0
0
0
0
0
0
0
0
0
0
0
0
0
0
...Use mask to extract out all German cars.
clc cars(germanyMask, :)
ans =
Acceleration Cylinders Displacement Horsepower MPG Model Model_Year Origin Weight
20.5 4 97 46 26 [1x36 char] 70 Germany 1835
14.5 4 107 90 24 [1x36 char] 70 Germany 2430
12.5 4 121 113 26 [1x36 char] 70 Germany 2234
20 4 97 48 NaN [1x36 char] 71 Germany 1978
14 4 116 90 28 [1x36 char] 71 Germany 2123
19 4 97 60 27 [1x36 char] 71 Germany 1834
23.5 4 97 54 23 [1x36 char] 72 Germany 2254
18 4 121 76 22 [1x36 char] 72 Germany 2511
21 4 97 46 26 [1x36 char] 73 Germany 1950
15.5 4 116 75 24 [1x36 char] 73 Germany 2158
14 4 114 91 20 [1x36 char] 73 Germany 2582
16.5 4 98 83 29 [1x36 char] 74 Germany 2219
15.5 4 79 67 26 [1x36 char] 74 Germany 1963
...Scatter plot grouped by the year of the make.
gscatter(cars.MPG, cars.Weight, cars.Model_Year, '', 'xos'); xlabel('Miles per Gallon') ylabel('Weight')
We notice a general trend, but the amount of data prevents us from getting useful information.
We can use filtering to refine the visualization. Let's extract out only the cars made in 1970, 1976, or 1982.
index = cars.Model_Year == 70 | cars.Model_Year == 76 | cars.Model_Year == 82; filtered = cars(index,:);
We have a more meaningful scatter plot for this smaller subset.
gscatter(filtered.MPG, filtered.Weight, filtered.Model_Year, '', 'xos'); xlabel('Miles per Gallon') ylabel('Weight')
Add interactive case names to the plot
gname(filtered.Model);
Concatenate and Join
We can combine datasets by either concatenating or joining.
Concatenate
We have a different set of data that corresponds to small cars. Let's combine this with the original dataset. First, we'll create a dataset object from this data.
% load carsmall cs = load('carsmall.mat'); % create dataset and convert variables to categorical arrays cars_s = dataset(cs); cars_s.Origin = nominal(cars_s.Origin); cars_s.Cylinders = ordinal(cars_s.Cylinders); % add additional levels match the levels from the carbig dataset cars_s.Cylinders = addlevels(cars_s.Cylinders, getlabels(cars.Cylinders)); cars_s.Cylinders = reorderlevels(cars_s.Cylinders, getlabels(cars.Cylinders));
Warning: Ignoring duplicate levels in NEWLEVELS.
Concatenate using the matrix concatenation notation.
cars_all = [cars; cars_s]; % alternatively, % cars_all = vertcat(cars, cars_s);
Join
Joining allows you to take the data in one dataset array and assign it to the rows of another dataset array, based on matching values in a common key variable.
clc tabulate(cars_all.Origin);
Value Count Percent
England 1 0.20%
France 18 3.56%
Germany 48 9.49%
Italy 9 1.78%
Japan 94 18.58%
Sweden 13 2.57%
USA 323 63.83%
Create a new dataset that maps countries to continents.
clc Newdata = dataset(... {nominal({'England';'France';'Germany';'Italy' ;'Japan';'Sweden';'USA' }),'Origin' }, ... {nominal({'Europe' ;'Europe';'Europe' ;'Europe';'Asia' ;'Europe';'North America'}),'Continent'})
Newdata =
Origin Continent
England Europe
France Europe
Germany Europe
Italy Europe
Japan Asia
Sweden Europe
USA North America
Join the two datasets to include Continent as a new variable.
cars_all = join(cars_all, Newdata); clc cars_all(1:10:100, :)
ans =
Acceleration Cylinders Displacement Horsepower MPG Model Model_Year Origin Weight Continent
12 8 307 130 18 [1x36 char] 70 USA 3504 North America
17.5 4 133 115 NaN [1x36 char] 70 France 3090 Europe
15 4 113 95 24 [1x36 char] 70 Japan 2372 Asia
15 6 199 90 21 [1x36 char] 70 USA 2648 North America
13 6 232 100 19 [1x36 char] 71 USA 2634 North America
12 8 400 170 13 [1x36 char] 71 USA 4746 North America
19 4 71 65 31 [1x36 char] 71 Japan 1773 Asia
12 8 400 175 14 [1x36 char] 72 USA 4385 North America
14 8 307 130 13 [1x36 char] 72 USA 4098 North America
15 4 98 80 28 [1x36 char] 72 USA 2164 North America
Dealing with Missing Data
Notice that we have some missing data in our MPG data
clc
cars(5:20, 'MPG')
ans =
MPG
17
15
14
14
14
15
NaN
NaN
NaN
NaN
NaN
15
14
...One way to deal with missing data is to substitute for the missing value. In this case, we will create a regression model to represent the performance measures (MPG) as functions of possible predictor variables (acceleration, cylinders, horsepower, model year, and weight)
X = [ones(length(cars.MPG),1) cars.Acceleration, double(cars.Cylinders), ...
cars.Displacement, cars.Horsepower, cars.Model_Year, cars.Weight];
Y = [cars.MPG];
[b,bint,r,rint,stats] = regress(Y, X);
Note that cars.Horsepower contains NaNs. The regress function performs listwise deletion on the independent variables
cars.regress = X * b;
fprintf('R-squared: %f\n', stats(1));
R-squared: 0.814178
Examine the residual.
residuals = cars.MPG - cars.regress; stem(cars.regress, residuals); xlabel('model'); ylabel('actual - model');
For cars with low or high MPG, the model seems to underestimate the MPG, while for cars in the middle, the model overestimates the true value.
gname(cars.Model)
Clean Up Our Model
We can potentially improve the model by adding dummy variables to handle diesels, automatic transmissions, and station wagons. In addition, we can filter out 3 and 5 cylinder engines, which are rotary engines.
Dummy variable is a binary variable that has a "1" where it satisfies the criteria and "0" everywhere else.
% Load in as a dataset object ds = dataset('xlsfile','dummy.xls'); % Concatenate carsall = [cars, ds]; carsall = set(carsall, 'Units', [get(carsall, 'Units'), {'', '', '', ''}]); % Filter out 3- and 5- cylinder engines index = carsall.Cylinders == '4' | carsall.Cylinders == '6' | carsall.Cylinders == '8'; carsall = carsall(index,:);
Create a New Regression Model
Create a new regression model just by looking at 4, 6, and 8 cylinders and taking into account the car type (station wagon, diesel, automatic).
X = [ones(length(carsall.MPG),1), carsall.Acceleration, ... double(carsall.Cylinders), carsall.Displacement, carsall.Horsepower, ... carsall.Model_Year, carsall.Weight, carsall.SW, carsall.Diesel, ... carsall.Automatic]; Y = [carsall.MPG]; [b, bint, r, rint, stats] = regress(Y, X); carsall.regress = X * b; residuals2 = carsall.MPG - carsall.regress; stem(carsall.regress, residuals2) xlabel('model'); ylabel('actual - model'); gname(carsall.Model)
Multivariate Analysis of Variance
Multivariate analysis of variance to see how similar the cars from various countries are, in terms of MPG, Acceleration, Weight, and Displacement.
X = [carsall.MPG, carsall.Acceleration, carsall.Weight, carsall.Displacement]; [d, p, stats] = manova1(X, carsall.Origin); manovacluster(stats)
We see that Japanese and German cars are quite similar, and they are very different from English and American cars
Let's add another dummy variable that distinguished Japanese and German cars. Then redo the regression.
carsall.dummy = (carsall.Origin == 'Germany' | carsall.Origin == 'Japan'); X = [ones(length(carsall.MPG),1), carsall.Acceleration, ... double(carsall.Cylinders), carsall.Displacement, carsall.Horsepower, ... carsall.Model_Year, carsall.Weight, carsall.SW, carsall.Diesel, ... carsall.Automatic carsall.dummy]; Y = [carsall.MPG]; [b, bint, r, rint, stats] = regress(Y, X); carsall.regress = X * b; % Inspect once again residuals2 = carsall.MPG - carsall.regress; stem(carsall.regress, residuals2) xlabel('model'); ylabel('actual - model'); gname(carsall.Model)
Robust Regression
We can also perform robust regression to deal with the outliers that may exist in the dataset.
X2 = [carsall.Acceleration, double(carsall.Cylinders), ... carsall.Displacement, carsall.Horsepower, carsall.Model_Year, ... carsall.Weight, carsall.SW, carsall.Diesel, carsall.Automatic, carsall.dummy]; [robustbeta, stats] = robustfit(X2, Y) X3 = [ones(length(carsall.MPG),1), X2]; carsall.regress2 = X3 * robustbeta;
robustbeta =
-4.1872
-0.20859
-1.6006
0.012578
-0.016641
0.64442
-0.0048378
0.11671
12.472
-2.6949
1.7508
stats =
ols_s: 2.991
robust_s: 2.8012
...Perform Regression Substitution
We have been looking at linear regressions so far, but we might be able to apply some nonlinear regressions to get a better predictive model.
Once we have a regression model, we can go ahead substitute the missing values with the model data.
carsall.mask = isnan(carsall.MPG);
carsall.MPG(carsall.mask) = carsall.regress2(carsall.mask);
carsall(5:20, 'MPG')
ans =
MPG
17
15
14
14
14
15
18.881
12.256
13.095
12.597
13.732
15
14
...