Solving Data Management and Analysis Challenges using MATLAB and Statistics Toolbox

Demo file for the Data Management and Statistics Webinar. This demo requires the Statistics Toolbox and was created using MATLAB 7.7 (R2008b).

In this demo, we will see how we can take a set of data describing performance and characteristics of various cars, and organize, extract, and visualize useful information for further analysis.

Contents

Automobile Data

Now let's begin. We'll work with this MAT file which contains some automobile data.

clear;
load carbig
whos
  Name                Size            Bytes  Class     Attributes

  Acceleration      406x1              3248  double              
  Cylinders         406x1              3248  double              
  Displacement      406x1              3248  double              
  Horsepower        406x1              3248  double              
  MPG               406x1              3248  double              
  Model             406x36            29232  char                
  Model_Year        406x1              3248  double              
  Origin            406x7              5684  char                
  Weight            406x1              3248  double              
  cyl4              406x5              4060  char                
  org               406x7              5684  char                
  when              406x5              4060  char                

This data set contains information regarding 406 different cars. There are different variables for each piece of information, and each row corresponds to the same car.

Dataset Object

Dataset objects allow you to organize information in a tabular format, and have structures very much like that of matrices. Each row represents the observations, or "cars" in this case, and the columns represent the variables, with the appropriate header names.

clc;
cars = dataset(Acceleration, Cylinders, Displacement, Horsepower, ...
  MPG, Model, Model_Year, Origin, Weight)
cars = 
    Acceleration    Cylinders    Displacement    Horsepower    MPG     Model          Model_Year    Origin     Weight
      12            8             307            130             18    [1x36 char]    70            USA        3504  
    11.5            8             350            165             15    [1x36 char]    70            USA        3693  
      11            8             318            150             18    [1x36 char]    70            USA        3436  
      12            8             304            150             16    [1x36 char]    70            USA        3433  
    10.5            8             302            140             17    [1x36 char]    70            USA        3449  
      10            8             429            198             15    [1x36 char]    70            USA        4341  
       9            8             454            220             14    [1x36 char]    70            USA        4354  
     8.5            8             440            215             14    [1x36 char]    70            USA        4312  
      10            8             455            225             14    [1x36 char]    70            USA        4425  
     8.5            8             390            190             15    [1x36 char]    70            USA        3850  
    17.5            4             133            115            NaN    [1x36 char]    70            France     3090  
    11.5            8             350            165            NaN    [1x36 char]    70            USA        4142  
      11            8             351            153            NaN    [1x36 char]    70            USA        4034  
...

The summary function provides basic statistical information for each of the variables included in the dataset object. Notice that there are some missing values for Horsepower and MPG, denoted by NaNs.

clc
summary(cars);
Acceleration: [406x1 double]
     min      1st Q      median      3rd Q      max  
       8       13.7        15.5       17.2      24.8 
Cylinders: [406x1 double]
     min      1st Q      median      3rd Q      max 
       3          4           4          8        8 
Displacement: [406x1 double]
     min      1st Q      median      3rd Q      max 
      68        105         151        302      455 
Horsepower: [406x1 double]
     min      1st Q      median      3rd Q      max      NaNs 
      46       75.5          95        130      230         6 
MPG: [406x1 double]
     min      1st Q      median      3rd Q      max       NaNs 
       9       17.5          23         29      46.6         8 
...

We can index into a dataset object like a regular matrix.

clc
cars(1:10, :)
ans = 
    Acceleration    Cylinders    Displacement    Horsepower    MPG    Model          Model_Year    Origin     Weight
      12            8            307             130           18     [1x36 char]    70            USA        3504  
    11.5            8            350             165           15     [1x36 char]    70            USA        3693  
      11            8            318             150           18     [1x36 char]    70            USA        3436  
      12            8            304             150           16     [1x36 char]    70            USA        3433  
    10.5            8            302             140           17     [1x36 char]    70            USA        3449  
      10            8            429             198           15     [1x36 char]    70            USA        4341  
       9            8            454             220           14     [1x36 char]    70            USA        4354  
     8.5            8            440             215           14     [1x36 char]    70            USA        4312  
      10            8            455             225           14     [1x36 char]    70            USA        4425  
     8.5            8            390             190           15     [1x36 char]    70            USA        3850  

We can access individual columns by referencing them by their names...

clc
cars(1:10, {'Origin', 'MPG', 'Weight'})
ans = 
    Origin     MPG    Weight
    USA        18     3504  
    USA        15     3693  
    USA        18     3436  
    USA        16     3433  
    USA        17     3449  
    USA        15     4341  
    USA        14     4354  
    USA        14     4312  
    USA        14     4425  
    USA        15     3850  

The dot-notation allows you to extract the whole content of a variable.

clc
cars.Horsepower(1:10)
ans =
   130
   165
   150
   150
   140
   198
   220
   215
   225
   190

Dataset objects store meta information.

get(cars)
    Description: ''
          Units: {}
       DimNames: {'Observations'  'Variables'}
       UserData: []
       ObsNames: {}
       VarNames: {'Acceleration'  'Cylinders'  'Displacement'  'Horsepower'  'MPG'  'Model'  'Model_Year'  'Origin'  'Weight'}

We can add dataset descriptions as well as units for the variables.

clc
cars = set(cars, 'Description', 'Performance and structural information of automobiles');
cars = set(cars, 'Units', {'m/s^2', '', 'mm', 'hp', 'mpg', '', '', '', 'kg'});

summary(cars)
Performance and structural information of automobiles

Acceleration: [406x1 double, Units = m/s^2]
     min      1st Q      median      3rd Q      max  
       8       13.7        15.5       17.2      24.8 
Cylinders: [406x1 double]
     min      1st Q      median      3rd Q      max 
       3          4           4          8        8 
Displacement: [406x1 double, Units = mm]
     min      1st Q      median      3rd Q      max 
      68        105         151        302      455 
Horsepower: [406x1 double, Units = hp]
     min      1st Q      median      3rd Q      max      NaNs 
      46       75.5          95        130      230         6 
MPG: [406x1 double, Units = mpg]
...

Categorical Arrays

Notice that some of the variables take on discrete values. For instance, the Cylinders, and Origin take on a unique set of values:

clc
disp('Cylinders:');
unique(cars(:, 'Cylinders'))

disp('Origin:');
unique(cars(:, 'Origin'))
Cylinders:
ans = 
    Cylinders
    3        
    4        
    5        
    6        
    8        
Origin:
ans = 
    Origin 
    England
    France 
    Germany
    Italy  
...

Categorical arrays provide significant memory savings. We will convert Cylinders to an ordinal array, which contains ordering information. The variable Origin will be converted to a nominal array, which does not store ordering.

clc
Cylinders_cat = ordinal(Cylinders);
Origin_cat = nominal(Origin);

whos Cylinders* Origin*
  Name                 Size            Bytes  Class      Attributes

  Cylinders          406x1              3248  double               
  Cylinders_cat      406x1              1178  ordinal              
  Origin             406x7              5684  char                 
  Origin_cat         406x1              1366  nominal              

Now, let's convert the variables of the dataset object.

cars.Cylinders = ordinal(cars.Cylinders);
cars.Origin = nominal(cars.Origin);

Filtering

Dataset objects can be easily filtered by criteria.

For example, we can create a logical array that has ONEs where the origin is Germany and ZEROs where it's not Germany.

germanyMask = cars.Origin == 'Germany'
germanyMask =
     0
     0
     0
     0
     0
     0
     0
     0
     0
     0
     0
     0
     0
     0
...

Use mask to extract out all German cars.

clc
cars(germanyMask, :)
ans = 
    Acceleration    Cylinders    Displacement    Horsepower    MPG     Model          Model_Year    Origin     Weight
    20.5            4             97              46             26    [1x36 char]    70            Germany    1835  
    14.5            4            107              90             24    [1x36 char]    70            Germany    2430  
    12.5            4            121             113             26    [1x36 char]    70            Germany    2234  
      20            4             97              48            NaN    [1x36 char]    71            Germany    1978  
      14            4            116              90             28    [1x36 char]    71            Germany    2123  
      19            4             97              60             27    [1x36 char]    71            Germany    1834  
    23.5            4             97              54             23    [1x36 char]    72            Germany    2254  
      18            4            121              76             22    [1x36 char]    72            Germany    2511  
      21            4             97              46             26    [1x36 char]    73            Germany    1950  
    15.5            4            116              75             24    [1x36 char]    73            Germany    2158  
      14            4            114              91             20    [1x36 char]    73            Germany    2582  
    16.5            4             98              83             29    [1x36 char]    74            Germany    2219  
    15.5            4             79              67             26    [1x36 char]    74            Germany    1963  
...

Scatter plot grouped by the year of the make.

gscatter(cars.MPG, cars.Weight, cars.Model_Year, '', 'xos');
xlabel('Miles per Gallon')
ylabel('Weight')

We notice a general trend, but the amount of data prevents us from getting useful information.

We can use filtering to refine the visualization. Let's extract out only the cars made in 1970, 1976, or 1982.

index = cars.Model_Year == 70 | cars.Model_Year == 76 | cars.Model_Year == 82;
filtered = cars(index,:);

We have a more meaningful scatter plot for this smaller subset.

gscatter(filtered.MPG, filtered.Weight, filtered.Model_Year, '', 'xos');
xlabel('Miles per Gallon')
ylabel('Weight')

Add interactive case names to the plot

gname(filtered.Model);

Concatenate and Join

We can combine datasets by either concatenating or joining.

Concatenate

We have a different set of data that corresponds to small cars. Let's combine this with the original dataset. First, we'll create a dataset object from this data.

% load carsmall
cs = load('carsmall.mat');

% create dataset and convert variables to categorical arrays
cars_s = dataset(cs);
cars_s.Origin = nominal(cars_s.Origin);
cars_s.Cylinders = ordinal(cars_s.Cylinders);

% add additional levels match the levels from the carbig dataset
cars_s.Cylinders = addlevels(cars_s.Cylinders, getlabels(cars.Cylinders));
cars_s.Cylinders = reorderlevels(cars_s.Cylinders, getlabels(cars.Cylinders));
Warning: Ignoring duplicate levels in NEWLEVELS. 

Concatenate using the matrix concatenation notation.

cars_all = [cars; cars_s];
% alternatively,
% cars_all = vertcat(cars, cars_s);

Join

Joining allows you to take the data in one dataset array and assign it to the rows of another dataset array, based on matching values in a common key variable.

clc
tabulate(cars_all.Origin);
    Value    Count   Percent
  England        1      0.20%
   France       18      3.56%
  Germany       48      9.49%
    Italy        9      1.78%
    Japan       94     18.58%
   Sweden       13      2.57%
      USA      323     63.83%

Create a new dataset that maps countries to continents.

clc
Newdata = dataset(...
  {nominal({'England';'France';'Germany';'Italy' ;'Japan';'Sweden';'USA'          }),'Origin'   }, ...
  {nominal({'Europe' ;'Europe';'Europe' ;'Europe';'Asia' ;'Europe';'North America'}),'Continent'})
Newdata = 
    Origin     Continent    
    England    Europe       
    France     Europe       
    Germany    Europe       
    Italy      Europe       
    Japan      Asia         
    Sweden     Europe       
    USA        North America

Join the two datasets to include Continent as a new variable.

cars_all = join(cars_all, Newdata);

clc
cars_all(1:10:100, :)
ans = 
    Acceleration    Cylinders    Displacement    Horsepower    MPG    Model          Model_Year    Origin    Weight    Continent    
      12            8            307             130            18    [1x36 char]    70            USA       3504      North America
    17.5            4            133             115           NaN    [1x36 char]    70            France    3090      Europe       
      15            4            113              95            24    [1x36 char]    70            Japan     2372      Asia         
      15            6            199              90            21    [1x36 char]    70            USA       2648      North America
      13            6            232             100            19    [1x36 char]    71            USA       2634      North America
      12            8            400             170            13    [1x36 char]    71            USA       4746      North America
      19            4             71              65            31    [1x36 char]    71            Japan     1773      Asia         
      12            8            400             175            14    [1x36 char]    72            USA       4385      North America
      14            8            307             130            13    [1x36 char]    72            USA       4098      North America
      15            4             98              80            28    [1x36 char]    72            USA       2164      North America

Dealing with Missing Data

Notice that we have some missing data in our MPG data

clc
cars(5:20, 'MPG')
ans = 
    MPG
     17
     15
     14
     14
     14
     15
    NaN
    NaN
    NaN
    NaN
    NaN
     15
     14
...

One way to deal with missing data is to substitute for the missing value. In this case, we will create a regression model to represent the performance measures (MPG) as functions of possible predictor variables (acceleration, cylinders, horsepower, model year, and weight)

X = [ones(length(cars.MPG),1) cars.Acceleration, double(cars.Cylinders), ...
  cars.Displacement, cars.Horsepower, cars.Model_Year, cars.Weight];
Y = [cars.MPG];
[b,bint,r,rint,stats] = regress(Y, X);

Note that cars.Horsepower contains NaNs. The regress function performs listwise deletion on the independent variables

cars.regress = X * b;
fprintf('R-squared: %f\n', stats(1));
R-squared: 0.814178

Examine the residual.

residuals = cars.MPG - cars.regress;
stem(cars.regress, residuals);
xlabel('model'); ylabel('actual - model');

For cars with low or high MPG, the model seems to underestimate the MPG, while for cars in the middle, the model overestimates the true value.

gname(cars.Model)

Clean Up Our Model

We can potentially improve the model by adding dummy variables to handle diesels, automatic transmissions, and station wagons. In addition, we can filter out 3 and 5 cylinder engines, which are rotary engines.

Dummy variable is a binary variable that has a "1" where it satisfies the criteria and "0" everywhere else.

% Load in as a dataset object
ds = dataset('xlsfile','dummy.xls');

% Concatenate
carsall = [cars, ds];
carsall = set(carsall, 'Units', [get(carsall, 'Units'), {'', '', '', ''}]);

% Filter out 3- and 5- cylinder engines
index = carsall.Cylinders == '4' | carsall.Cylinders == '6' | carsall.Cylinders == '8';
carsall = carsall(index,:);

Create a New Regression Model

Create a new regression model just by looking at 4, 6, and 8 cylinders and taking into account the car type (station wagon, diesel, automatic).

X = [ones(length(carsall.MPG),1), carsall.Acceleration, ...
  double(carsall.Cylinders), carsall.Displacement, carsall.Horsepower, ...
  carsall.Model_Year, carsall.Weight, carsall.SW, carsall.Diesel, ...
  carsall.Automatic];
Y = [carsall.MPG];
[b, bint, r, rint, stats] = regress(Y, X);

carsall.regress = X * b;

residuals2 = carsall.MPG - carsall.regress;
stem(carsall.regress, residuals2)
xlabel('model'); ylabel('actual - model');
gname(carsall.Model)

Multivariate Analysis of Variance

Multivariate analysis of variance to see how similar the cars from various countries are, in terms of MPG, Acceleration, Weight, and Displacement.

X = [carsall.MPG, carsall.Acceleration, carsall.Weight, carsall.Displacement];
[d, p, stats] = manova1(X, carsall.Origin);
manovacluster(stats)

We see that Japanese and German cars are quite similar, and they are very different from English and American cars

Let's add another dummy variable that distinguished Japanese and German cars. Then redo the regression.

carsall.dummy = (carsall.Origin == 'Germany' | carsall.Origin == 'Japan');

X = [ones(length(carsall.MPG),1), carsall.Acceleration, ...
  double(carsall.Cylinders), carsall.Displacement, carsall.Horsepower, ...
  carsall.Model_Year, carsall.Weight, carsall.SW, carsall.Diesel, ...
  carsall.Automatic carsall.dummy];
Y = [carsall.MPG];
[b, bint, r, rint, stats] = regress(Y, X);

carsall.regress = X * b;

% Inspect once again
residuals2 = carsall.MPG - carsall.regress;
stem(carsall.regress, residuals2)
xlabel('model'); ylabel('actual - model');
gname(carsall.Model)

Robust Regression

We can also perform robust regression to deal with the outliers that may exist in the dataset.

X2 = [carsall.Acceleration, double(carsall.Cylinders), ...
  carsall.Displacement, carsall.Horsepower, carsall.Model_Year, ...
  carsall.Weight, carsall.SW, carsall.Diesel, carsall.Automatic, carsall.dummy];
[robustbeta, stats] = robustfit(X2, Y)
X3 = [ones(length(carsall.MPG),1), X2];

carsall.regress2 = X3 * robustbeta;
robustbeta =
      -4.1872
     -0.20859
      -1.6006
     0.012578
    -0.016641
      0.64442
   -0.0048378
      0.11671
       12.472
      -2.6949
       1.7508
stats = 
        ols_s: 2.991
     robust_s: 2.8012
...

Perform Regression Substitution

We have been looking at linear regressions so far, but we might be able to apply some nonlinear regressions to get a better predictive model.

Once we have a regression model, we can go ahead substitute the missing values with the model data.

carsall.mask = isnan(carsall.MPG);
carsall.MPG(carsall.mask) = carsall.regress2(carsall.mask);

carsall(5:20, 'MPG')
ans = 
    MPG   
        17
        15
        14
        14
        14
        15
    18.881
    12.256
    13.095
    12.597
    13.732
        15
        14
...