Solving Data Management and Analysis Challenges using MATLAB® and Statistics Toolbox™
Demo file for the Data Management and Statistics Webinar. This demo requires the Statistics Toolbox and was created using MATLAB 7.7 (R2008b).
In this demo, we will see how we can take a set of data describing performance and characteristics of various cars, and organize, extract, and visualize useful information for further analysis.
Contents
Automobile Data
Now let's begin. We'll work with this MAT file which contains some automobile data.
clear;
load carbig
whos
Name Size Bytes Class Attributes Acceleration 406x1 3248 double Cylinders 406x1 3248 double Displacement 406x1 3248 double Horsepower 406x1 3248 double MPG 406x1 3248 double Model 406x36 29232 char Model_Year 406x1 3248 double Origin 406x7 5684 char Weight 406x1 3248 double cyl4 406x5 4060 char org 406x7 5684 char when 406x5 4060 char
This data set contains information regarding 406 different cars. There are different variables for each piece of information, and each row corresponds to the same car.
Dataset Object
Dataset objects allow you to organize information in a tabular format, and have structures very much like that of matrices. Each row represents the observations, or "cars" in this case, and the columns represent the variables, with the appropriate header names.
clc;
cars = dataset(Acceleration, Cylinders, Displacement, Horsepower, ...
MPG, Model, Model_Year, Origin, Weight)
cars =
Acceleration Cylinders Displacement Horsepower
12 8 307 130
11.5 8 350 165
11 8 318 150
12 8 304 150
10.5 8 302 140
10 8 429 198
9 8 454 220
8.5 8 440 215
10 8 455 225
8.5 8 390 190
17.5 4 133 115
11.5 8 350 165
11 8 351 153
10.5 8 383 175
11 8 360 175
10 8 383 170
8 8 340 160
8 8 302 140
9.5 8 400 150
10 8 455 225
15 4 113 95
15.5 6 198 95
15.5 6 199 97
16 6 200 85
14.5 4 97 88
20.5 4 97 46
17.5 4 110 87
14.5 4 107 90
17.5 4 104 95
12.5 4 121 113
15 6 199 90
14 8 360 215
15 8 307 200
13.5 8 318 210
18.5 8 304 193
14.5 4 97 88
15.5 4 140 90
14 4 113 95
19 4 98 NaN
20 4 97 48
13 6 232 100
15.5 6 225 105
15.5 6 250 100
15.5 6 250 88
15.5 6 232 100
12 8 350 165
11.5 8 400 175
13.5 8 351 153
13 8 318 150
11.5 8 383 180
12 8 400 170
12 8 400 175
13.5 6 258 110
19 4 140 72
15 6 250 100
14.5 6 250 88
14 4 122 86
14 4 116 90
19.5 4 79 70
14.5 4 88 76
19 4 71 65
18 4 72 69
19 4 97 60
20.5 4 91 70
15.5 4 113 95
17 4 97.5 80
23.5 4 97 54
19.5 4 140 90
16.5 4 122 86
12 8 350 165
12 8 400 175
13.5 8 318 150
13 8 351 153
11.5 8 304 150
11 8 429 208
13.5 8 350 155
13.5 8 350 160
12.5 8 400 190
13.5 3 70 97
12.5 8 304 150
14 8 307 130
16 8 302 140
14 8 318 150
14.5 4 121 112
18 4 121 76
19.5 4 120 87
18 4 96 69
16 4 122 86
17 4 97 92
14.5 4 120 97
15 4 98 80
16.5 4 97 88
13 8 350 175
11.5 8 304 150
13 8 350 145
14.5 8 302 137
12.5 8 318 150
11.5 8 429 198
12 8 400 150
13 8 351 158
14.5 8 318 150
11 8 440 215
11 8 455 225
11 8 360 175
16.5 6 225 105
18 6 250 100
16 6 232 100
16.5 6 250 88
16 6 198 95
21 4 97 46
14 8 400 150
12.5 8 400 167
13 8 360 170
12.5 8 350 180
15 6 232 100
19 4 97 88
19.5 4 140 72
16.5 4 108 94
13.5 3 70 90
18.5 4 122 85
14 6 155 107
15.5 4 98 90
13 8 350 145
9.5 8 400 230
19.5 4 68 49
15.5 4 116 75
14 4 114 91
15.5 4 121 112
11 8 318 150
14 4 121 110
13.5 6 156 122
11 8 350 180
16.5 6 198 95
17 6 200 NaN
16 6 232 100
17 6 250 100
19 4 79 67
16.5 4 122 80
21 4 71 65
17 4 140 75
17 6 250 100
18 6 258 110
16.5 6 225 105
14 8 302 140
14.5 8 350 150
13.5 8 318 150
16 8 302 140
15.5 8 304 150
16.5 4 98 83
15.5 4 79 67
14.5 4 97 78
16.5 4 76 52
19 4 83 61
14.5 4 90 75
15.5 4 90 75
14 4 116 75
15 4 120 97
15.5 4 108 93
16 4 79 67
16 6 225 95
16 6 250 105
21 6 250 72
19.5 6 250 72
11.5 8 400 170
14 8 350 145
14.5 8 318 150
13.5 8 351 148
21 6 231 110
18.5 6 250 105
19 6 258 110
19 6 225 95
15 6 231 110
13.5 8 262 110
12 8 302 129
16 4 97 75
17 4 140 83
16 6 232 100
18.5 4 140 78
13.5 4 134 96
16.5 4 90 71
17 4 119 97
14.5 6 171 97
14 4 90 70
17 6 232 90
15 4 115 95
17 4 120 88
14.5 4 121 98
13.5 4 121 115
17.5 4 91 53
15.5 4 107 86
16.9 4 116 81
14.9 4 140 92
17.7 4 98 79
15.3 4 101 83
13 8 305 140
13 8 318 150
13.9 8 304 120
12.8 8 351 152
15.4 6 225 100
14.5 6 250 105
17.6 6 200 81
17.6 6 232 90
22.2 4 85 52
22.1 4 98 60
14.2 4 90 70
17.4 4 91 53
17.7 6 225 100
21 6 250 78
16.2 6 250 110
17.8 6 258 95
12.2 4 97 71
17 4 85 70
16.4 4 97 75
13.6 4 140 72
15.7 4 130 102
13.2 8 318 150
21.9 4 120 88
15.5 6 156 108
16.7 6 168 120
12.1 8 350 180
12 8 350 145
15 8 302 130
14 8 318 150
18.5 4 98 68
14.8 4 111 80
18.6 4 79 58
15.5 4 122 96
16.8 4 85 70
12.5 8 305 145
19 8 260 110
13.7 8 318 145
14.9 8 302 130
16.4 6 250 110
16.9 6 231 105
17.7 6 225 100
19 6 250 98
11.1 8 400 180
11.4 8 350 170
12.2 8 400 190
14.5 8 351 149
14.5 4 97 78
16 4 151 88
18.2 4 97 75
15.8 4 140 89
17 4 98 63
15.9 4 98 83
16.4 4 97 67
14.1 4 97 78
14.5 6 146 97
12.8 4 121 110
13.5 3 80 110
21.5 4 90 48
14.4 4 98 66
19.4 4 78 52
18.6 4 85 70
16.4 4 91 60
15.5 8 260 110
13.2 8 318 140
12.8 8 302 139
19.2 6 231 105
18.2 6 200 95
15.8 6 200 85
15.4 4 140 88
17.2 6 225 100
17.2 6 232 90
15.8 6 231 105
16.7 6 200 85
18.7 6 225 110
15.1 6 258 120
13.2 8 305 145
13.4 6 231 165
11.2 8 302 139
13.7 8 318 140
16.5 4 98 68
14.2 4 134 95
14.7 4 119 97
14.5 4 105 75
14.8 4 134 95
16.7 4 156 105
17.6 4 151 85
14.9 4 119 97
15.9 5 131 103
13.6 6 163 125
15.7 4 121 115
15.8 6 163 133
14.9 4 89 71
16.6 4 98 68
15.4 6 231 115
18.2 6 200 85
17.3 4 140 88
18.2 6 232 90
16.6 6 225 110
15.4 8 305 130
13.4 8 302 129
13.2 8 351 138
15.2 8 318 135
14.9 8 350 155
14.3 8 351 142
15 8 267 125
13 8 360 150
14 4 89 71
15.2 4 86 65
14.4 4 98 80
15 4 121 80
20.1 5 183 77
17.4 8 350 125
24.8 4 141 71
22.2 8 260 90
13.2 4 105 70
14.9 4 105 70
19.2 4 85 65
14.7 4 91 69
16 4 151 90
11.3 6 173 115
12.9 6 173 115
13.2 4 151 90
14.7 4 98 76
18.8 4 89 60
15.5 4 98 70
16.4 4 86 65
16.5 4 151 90
18.1 4 140 88
20.1 4 151 90
18.7 6 225 90
15.8 4 97 78
15.5 4 134 90
17.5 4 120 75
15 4 119 92
15.2 4 108 75
17.9 4 86 65
14.4 4 156 105
19.2 4 85 65
21.7 4 90 48
23.7 4 90 48
19.9 5 121 67
21.8 4 146 67
13.8 4 91 67
17.3 4 85 NaN
18 4 97 67
15.3 4 89 62
11.4 6 168 132
12.5 3 70 100
15.1 4 122 88
14.3 4 140 NaN
17 4 107 72
15.7 4 135 84
16.4 4 151 84
14.4 4 156 92
12.6 6 173 110
12.9 4 135 84
16.9 4 79 58
16.4 4 86 64
16.1 4 81 60
17.8 4 97 67
19.4 4 85 65
17.3 4 89 62
16 4 91 68
14.9 4 105 63
16.2 4 98 65
20.7 4 98 65
14.2 4 105 74
15.8 4 100 NaN
14.4 4 107 75
16.8 4 108 75
14.8 4 119 100
18.3 4 120 74
20.4 4 141 80
15.4 4 121 110
19.6 6 145 76
12.6 6 168 116
13.8 6 146 120
15.8 6 231 110
19 8 350 105
17.1 6 200 88
16.6 6 225 85
19.6 4 112 88
18.6 4 112 88
18 4 112 88
16.2 4 112 85
16 4 135 84
18 4 151 90
16.4 4 140 92
20.5 4 151 NaN
15.3 4 105 74
18.2 4 91 68
17.6 4 91 68
14.7 4 105 63
17.3 4 98 70
14.5 4 120 88
14.5 4 107 75
16.9 4 108 70
15 4 91 67
15.7 4 91 67
16.2 4 91 67
16.4 6 181 110
17 6 262 85
14.5 4 156 92
14.7 6 232 112
13.9 4 144 96
13 4 135 84
17.3 4 151 90
15.6 4 140 86
24.6 4 97 52
11.6 4 135 84
18.6 4 120 79
19.4 4 119 82
MPG Model Model_Year Origin Weight
18 [1x36 char] 70 USA 3504
15 [1x36 char] 70 USA 3693
18 [1x36 char] 70 USA 3436
16 [1x36 char] 70 USA 3433
17 [1x36 char] 70 USA 3449
15 [1x36 char] 70 USA 4341
14 [1x36 char] 70 USA 4354
14 [1x36 char] 70 USA 4312
14 [1x36 char] 70 USA 4425
15 [1x36 char] 70 USA 3850
NaN [1x36 char] 70 France 3090
NaN [1x36 char] 70 USA 4142
NaN [1x36 char] 70 USA 4034
NaN [1x36 char] 70 USA 4166
NaN [1x36 char] 70 USA 3850
15 [1x36 char] 70 USA 3563
14 [1x36 char] 70 USA 3609
NaN [1x36 char] 70 USA 3353
15 [1x36 char] 70 USA 3761
14 [1x36 char] 70 USA 3086
24 [1x36 char] 70 Japan 2372
22 [1x36 char] 70 USA 2833
18 [1x36 char] 70 USA 2774
21 [1x36 char] 70 USA 2587
27 [1x36 char] 70 Japan 2130
26 [1x36 char] 70 Germany 1835
25 [1x36 char] 70 France 2672
24 [1x36 char] 70 Germany 2430
25 [1x36 char] 70 Sweden 2375
26 [1x36 char] 70 Germany 2234
21 [1x36 char] 70 USA 2648
10 [1x36 char] 70 USA 4615
10 [1x36 char] 70 USA 4376
11 [1x36 char] 70 USA 4382
9 [1x36 char] 70 USA 4732
27 [1x36 char] 71 Japan 2130
28 [1x36 char] 71 USA 2264
25 [1x36 char] 71 Japan 2228
25 [1x36 char] 71 USA 2046
NaN [1x36 char] 71 Germany 1978
19 [1x36 char] 71 USA 2634
16 [1x36 char] 71 USA 3439
17 [1x36 char] 71 USA 3329
19 [1x36 char] 71 USA 3302
18 [1x36 char] 71 USA 3288
14 [1x36 char] 71 USA 4209
14 [1x36 char] 71 USA 4464
14 [1x36 char] 71 USA 4154
14 [1x36 char] 71 USA 4096
12 [1x36 char] 71 USA 4955
13 [1x36 char] 71 USA 4746
13 [1x36 char] 71 USA 5140
18 [1x36 char] 71 USA 2962
22 [1x36 char] 71 USA 2408
19 [1x36 char] 71 USA 3282
18 [1x36 char] 71 USA 3139
23 [1x36 char] 71 USA 2220
28 [1x36 char] 71 Germany 2123
30 [1x36 char] 71 France 2074
30 [1x36 char] 71 Italy 2065
31 [1x36 char] 71 Japan 1773
35 [1x36 char] 71 Japan 1613
27 [1x36 char] 71 Germany 1834
26 [1x36 char] 71 USA 1955
24 [1x36 char] 72 Japan 2278
25 [1x36 char] 72 USA 2126
23 [1x36 char] 72 Germany 2254
20 [1x36 char] 72 USA 2408
21 [1x36 char] 72 USA 2226
13 [1x36 char] 72 USA 4274
14 [1x36 char] 72 USA 4385
15 [1x36 char] 72 USA 4135
14 [1x36 char] 72 USA 4129
17 [1x36 char] 72 USA 3672
11 [1x36 char] 72 USA 4633
13 [1x36 char] 72 USA 4502
12 [1x36 char] 72 USA 4456
13 [1x36 char] 72 USA 4422
19 [1x36 char] 72 Japan 2330
15 [1x36 char] 72 USA 3892
13 [1x36 char] 72 USA 4098
13 [1x36 char] 72 USA 4294
14 [1x36 char] 72 USA 4077
18 [1x36 char] 72 Sweden 2933
22 [1x36 char] 72 Germany 2511
21 [1x36 char] 72 France 2979
26 [1x36 char] 72 France 2189
22 [1x36 char] 72 USA 2395
28 [1x36 char] 72 Japan 2288
23 [1x36 char] 72 Japan 2506
28 [1x36 char] 72 USA 2164
27 [1x36 char] 72 Japan 2100
13 [1x36 char] 73 USA 4100
14 [1x36 char] 73 USA 3672
13 [1x36 char] 73 USA 3988
14 [1x36 char] 73 USA 4042
15 [1x36 char] 73 USA 3777
12 [1x36 char] 73 USA 4952
13 [1x36 char] 73 USA 4464
13 [1x36 char] 73 USA 4363
14 [1x36 char] 73 USA 4237
13 [1x36 char] 73 USA 4735
12 [1x36 char] 73 USA 4951
13 [1x36 char] 73 USA 3821
18 [1x36 char] 73 USA 3121
16 [1x36 char] 73 USA 3278
18 [1x36 char] 73 USA 2945
18 [1x36 char] 73 USA 3021
23 [1x36 char] 73 USA 2904
26 [1x36 char] 73 Germany 1950
11 [1x36 char] 73 USA 4997
12 [1x36 char] 73 USA 4906
13 [1x36 char] 73 USA 4654
12 [1x36 char] 73 USA 4499
18 [1x36 char] 73 USA 2789
20 [1x36 char] 73 Japan 2279
21 [1x36 char] 73 USA 2401
22 [1x36 char] 73 Japan 2379
18 [1x36 char] 73 Japan 2124
19 [1x36 char] 73 USA 2310
21 [1x36 char] 73 USA 2472
26 [1x36 char] 73 Italy 2265
15 [1x36 char] 73 USA 4082
16 [1x36 char] 73 USA 4278
29 [1x36 char] 73 Italy 1867
24 [1x36 char] 73 Germany 2158
20 [1x36 char] 73 Germany 2582
19 [1x36 char] 73 Sweden 2868
15 [1x36 char] 73 USA 3399
24 [1x36 char] 73 Sweden 2660
20 [1x36 char] 73 Japan 2807
11 [1x36 char] 73 USA 3664
20 [1x36 char] 74 USA 3102
21 [1x36 char] 74 USA 2875
19 [1x36 char] 74 USA 2901
15 [1x36 char] 74 USA 3336
31 [1x36 char] 74 Japan 1950
26 [1x36 char] 74 USA 2451
32 [1x36 char] 74 Japan 1836
25 [1x36 char] 74 USA 2542
16 [1x36 char] 74 USA 3781
16 [1x36 char] 74 USA 3632
18 [1x36 char] 74 USA 3613
16 [1x36 char] 74 USA 4141
13 [1x36 char] 74 USA 4699
14 [1x36 char] 74 USA 4457
14 [1x36 char] 74 USA 4638
14 [1x36 char] 74 USA 4257
29 [1x36 char] 74 Germany 2219
26 [1x36 char] 74 Germany 1963
26 [1x36 char] 74 Germany 2300
31 [1x36 char] 74 Japan 1649
32 [1x36 char] 74 Japan 2003
28 [1x36 char] 74 USA 2125
24 [1x36 char] 74 Italy 2108
26 [1x36 char] 74 Italy 2246
24 [1x36 char] 74 Japan 2489
26 [1x36 char] 74 Japan 2391
31 [1x36 char] 74 Italy 2000
19 [1x36 char] 75 USA 3264
18 [1x36 char] 75 USA 3459
15 [1x36 char] 75 USA 3432
15 [1x36 char] 75 USA 3158
16 [1x36 char] 75 USA 4668
15 [1x36 char] 75 USA 4440
16 [1x36 char] 75 USA 4498
14 [1x36 char] 75 USA 4657
17 [1x36 char] 75 USA 3907
16 [1x36 char] 75 USA 3897
15 [1x36 char] 75 USA 3730
18 [1x36 char] 75 USA 3785
21 [1x36 char] 75 USA 3039
20 [1x36 char] 75 USA 3221
13 [1x36 char] 75 USA 3169
29 [1x36 char] 75 Japan 2171
23 [1x36 char] 75 USA 2639
20 [1x36 char] 75 USA 2914
23 [1x36 char] 75 USA 2592
24 [1x36 char] 75 Japan 2702
25 [1x36 char] 75 Germany 2223
24 [1x36 char] 75 Japan 2545
18 [1x36 char] 75 USA 2984
29 [1x36 char] 75 Germany 1937
19 [1x36 char] 75 USA 3211
23 [1x36 char] 75 Germany 2694
23 [1x36 char] 75 France 2957
22 [1x36 char] 75 Sweden 2945
25 [1x36 char] 75 Sweden 2671
33 [1x36 char] 75 Japan 1795
28 [1x36 char] 76 Italy 2464
25 [1x36 char] 76 Germany 2220
25 [1x36 char] 76 USA 2572
26 [1x36 char] 76 USA 2255
27 [1x36 char] 76 France 2202
17.5 [1x36 char] 76 USA 4215
16 [1x36 char] 76 USA 4190
15.5 [1x36 char] 76 USA 3962
14.5 [1x36 char] 76 USA 4215
22 [1x36 char] 76 USA 3233
22 [1x36 char] 76 USA 3353
24 [1x36 char] 76 USA 3012
22.5 [1x36 char] 76 USA 3085
29 [1x36 char] 76 USA 2035
24.5 [1x36 char] 76 USA 2164
29 [1x36 char] 76 Germany 1937
33 [1x36 char] 76 Japan 1795
20 [1x36 char] 76 USA 3651
18 [1x36 char] 76 USA 3574
18.5 [1x36 char] 76 USA 3645
17.5 [1x36 char] 76 USA 3193
29.5 [1x36 char] 76 Germany 1825
32 [1x36 char] 76 Japan 1990
28 [1x36 char] 76 Japan 2155
26.5 [1x36 char] 76 USA 2565
20 [1x36 char] 76 Sweden 3150
13 [1x36 char] 76 USA 3940
19 [1x36 char] 76 France 3270
19 [1x36 char] 76 Japan 2930
16.5 [1x36 char] 76 Germany 3820
16.5 [1x36 char] 76 USA 4380
13 [1x36 char] 76 USA 4055
13 [1x36 char] 76 USA 3870
13 [1x36 char] 76 USA 3755
31.5 [1x36 char] 77 Japan 2045
30 [1x36 char] 77 USA 2155
36 [1x36 char] 77 France 1825
25.5 [1x36 char] 77 USA 2300
33.5 [1x36 char] 77 Japan 1945
17.5 [1x36 char] 77 USA 3880
17 [1x36 char] 77 USA 4060
15.5 [1x36 char] 77 USA 4140
15 [1x36 char] 77 USA 4295
17.5 [1x36 char] 77 USA 3520
20.5 [1x36 char] 77 USA 3425
19 [1x36 char] 77 USA 3630
18.5 [1x36 char] 77 USA 3525
16 [1x36 char] 77 USA 4220
15.5 [1x36 char] 77 USA 4165
15.5 [1x36 char] 77 USA 4325
16 [1x36 char] 77 USA 4335
29 [1x36 char] 77 Germany 1940
24.5 [1x36 char] 77 USA 2740
26 [1x36 char] 77 Japan 2265
25.5 [1x36 char] 77 USA 2755
30.5 [1x36 char] 77 USA 2051
33.5 [1x36 char] 77 USA 2075
30 [1x36 char] 77 Japan 1985
30.5 [1x36 char] 77 Germany 2190
22 [1x36 char] 77 Japan 2815
21.5 [1x36 char] 77 Germany 2600
21.5 [1x36 char] 77 Japan 2720
43.1 [1x36 char] 78 Germany 1985
36.1 [1x36 char] 78 USA 1800
32.8 [1x36 char] 78 Japan 1985
39.4 [1x36 char] 78 Japan 2070
36.1 [1x36 char] 78 Japan 1800
19.9 [1x36 char] 78 USA 3365
19.4 [1x36 char] 78 USA 3735
20.2 [1x36 char] 78 USA 3570
19.2 [1x36 char] 78 USA 3535
20.5 [1x36 char] 78 USA 3155
20.2 [1x36 char] 78 USA 2965
25.1 [1x36 char] 78 USA 2720
20.5 [1x36 char] 78 USA 3430
19.4 [1x36 char] 78 USA 3210
20.6 [1x36 char] 78 USA 3380
20.8 [1x36 char] 78 USA 3070
18.6 [1x36 char] 78 USA 3620
18.1 [1x36 char] 78 USA 3410
19.2 [1x36 char] 78 USA 3425
17.7 [1x36 char] 78 USA 3445
18.1 [1x36 char] 78 USA 3205
17.5 [1x36 char] 78 USA 4080
30 [1x36 char] 78 USA 2155
27.5 [1x36 char] 78 Japan 2560
27.2 [1x36 char] 78 Japan 2300
30.9 [1x36 char] 78 USA 2230
21.1 [1x36 char] 78 Japan 2515
23.2 [1x36 char] 78 USA 2745
23.8 [1x36 char] 78 USA 2855
23.9 [1x36 char] 78 Japan 2405
20.3 [1x36 char] 78 Germany 2830
17 [1x36 char] 78 Sweden 3140
21.6 [1x36 char] 78 Sweden 2795
16.2 [1x36 char] 78 France 3410
31.5 [1x36 char] 78 Germany 1990
29.5 [1x36 char] 78 Japan 2135
21.5 [1x36 char] 79 USA 3245
19.8 [1x36 char] 79 USA 2990
22.3 [1x36 char] 79 USA 2890
20.2 [1x36 char] 79 USA 3265
20.6 [1x36 char] 79 USA 3360
17 [1x36 char] 79 USA 3840
17.6 [1x36 char] 79 USA 3725
16.5 [1x36 char] 79 USA 3955
18.2 [1x36 char] 79 USA 3830
16.9 [1x36 char] 79 USA 4360
15.5 [1x36 char] 79 USA 4054
19.2 [1x36 char] 79 USA 3605
18.5 [1x36 char] 79 USA 3940
31.9 [1x36 char] 79 Germany 1925
34.1 [1x36 char] 79 Japan 1975
35.7 [1x36 char] 79 USA 1915
27.4 [1x36 char] 79 USA 2670
25.4 [1x36 char] 79 Germany 3530
23 [1x36 char] 79 USA 3900
27.2 [1x36 char] 79 France 3190
23.9 [1x36 char] 79 USA 3420
34.2 [1x36 char] 79 USA 2200
34.5 [1x36 char] 79 USA 2150
31.8 [1x36 char] 79 Japan 2020
37.3 [1x36 char] 79 Italy 2130
28.4 [1x36 char] 79 USA 2670
28.8 [1x36 char] 79 USA 2595
26.8 [1x36 char] 79 USA 2700
33.5 [1x36 char] 79 USA 2556
41.5 [1x36 char] 80 Germany 2144
38.1 [1x36 char] 80 Japan 1968
32.1 [1x36 char] 80 USA 2120
37.2 [1x36 char] 80 Japan 2019
28 [1x36 char] 80 USA 2678
26.4 [1x36 char] 80 USA 2870
24.3 [1x36 char] 80 USA 3003
19.1 [1x36 char] 80 USA 3381
34.3 [1x36 char] 80 Germany 2188
29.8 [1x36 char] 80 Japan 2711
31.3 [1x36 char] 80 Japan 2542
37 [1x36 char] 80 Japan 2434
32.2 [1x36 char] 80 Japan 2265
46.6 [1x36 char] 80 Japan 2110
27.9 [1x36 char] 80 USA 2800
40.8 [1x36 char] 80 Japan 2110
44.3 [1x36 char] 80 Germany 2085
43.4 [1x36 char] 80 Germany 2335
36.4 [1x36 char] 80 Germany 2950
30 [1x36 char] 80 Germany 3250
44.6 [1x36 char] 80 Japan 1850
40.9 [1x36 char] 80 France 1835
33.8 [1x36 char] 80 Japan 2145
29.8 [1x36 char] 80 Germany 1845
32.7 [1x36 char] 80 Japan 2910
23.7 [1x36 char] 80 Japan 2420
35 [1x36 char] 80 England 2500
23.6 [1x36 char] 80 USA 2905
32.4 [1x36 char] 80 Japan 2290
27.2 [1x36 char] 81 USA 2490
26.6 [1x36 char] 81 USA 2635
25.8 [1x36 char] 81 USA 2620
23.5 [1x36 char] 81 USA 2725
30 [1x36 char] 81 USA 2385
39.1 [1x36 char] 81 Japan 1755
39 [1x36 char] 81 USA 1875
35.1 [1x36 char] 81 Japan 1760
32.3 [1x36 char] 81 Japan 2065
37 [1x36 char] 81 Japan 1975
37.7 [1x36 char] 81 Japan 2050
34.1 [1x36 char] 81 Japan 1985
34.7 [1x36 char] 81 USA 2215
34.4 [1x36 char] 81 USA 2045
29.9 [1x36 char] 81 USA 2380
33 [1x36 char] 81 Germany 2190
34.5 [1x36 char] 81 France 2320
33.7 [1x36 char] 81 Japan 2210
32.4 [1x36 char] 81 Japan 2350
32.9 [1x36 char] 81 Japan 2615
31.6 [1x36 char] 81 Japan 2635
28.1 [1x36 char] 81 France 3230
NaN [1x36 char] 81 Sweden 2800
30.7 [1x36 char] 81 Sweden 3160
25.4 [1x36 char] 81 Japan 2900
24.2 [1x36 char] 81 Japan 2930
22.4 [1x36 char] 81 USA 3415
26.6 [1x36 char] 81 USA 3725
20.2 [1x36 char] 81 USA 3060
17.6 [1x36 char] 81 USA 3465
28 [1x36 char] 82 USA 2605
27 [1x36 char] 82 USA 2640
34 [1x36 char] 82 USA 2395
31 [1x36 char] 82 USA 2575
29 [1x36 char] 82 USA 2525
27 [1x36 char] 82 USA 2735
24 [1x36 char] 82 USA 2865
23 [1x36 char] 82 USA 3035
36 [1x36 char] 82 Germany 1980
37 [1x36 char] 82 Japan 2025
31 [1x36 char] 82 Japan 1970
38 [1x36 char] 82 USA 2125
36 [1x36 char] 82 USA 2125
36 [1x36 char] 82 Japan 2160
36 [1x36 char] 82 Japan 2205
34 [1x36 char] 82 Japan 2245
38 [1x36 char] 82 Japan 1965
32 [1x36 char] 82 Japan 1965
38 [1x36 char] 82 Japan 1995
25 [1x36 char] 82 USA 2945
38 [1x36 char] 82 USA 3015
26 [1x36 char] 82 USA 2585
22 [1x36 char] 82 USA 2835
32 [1x36 char] 82 Japan 2665
36 [1x36 char] 82 USA 2370
27 [1x36 char] 82 USA 2950
27 [1x36 char] 82 USA 2790
44 [1x36 char] 82 Germany 2130
32 [1x36 char] 82 USA 2295
28 [1x36 char] 82 USA 2625
31 [1x36 char] 82 USA 2720
The summary function provides basic statistical information for each of the variables included in the dataset object. Notice that there are some missing values for Horsepower and MPG, denoted by NaNs.
clc summary(cars);
Acceleration: [406x1 double]
min 1st Q median 3rd Q max
8 13.7000 15.5000 17.2000 24.8000
Cylinders: [406x1 double]
min 1st Q median 3rd Q max
3 4 4 8 8
Displacement: [406x1 double]
min 1st Q median 3rd Q max
68 105 151 302 455
Horsepower: [406x1 double]
Columns 1 through 5
min 1st Q median 3rd Q max
46 75.5000 95 130 230
Column 6
NaNs
6
MPG: [406x1 double]
Columns 1 through 5
min 1st Q median 3rd Q max
9 17.5000 23 29 46.6000
Column 6
NaNs
8
Model: [406x36 char]
Model_Year: [406x1 double]
min 1st Q median 3rd Q max
70 73 76 79 82
Origin: [406x7 char]
Weight: [406x1 double]
min 1st Q median 3rd Q max
1613 2226 2.8225e+003 3620 5140
We can index into a dataset object like a regular matrix.
clc cars(1:10, :)
ans =
Acceleration Cylinders Displacement Horsepower
12 8 307 130
11.5 8 350 165
11 8 318 150
12 8 304 150
10.5 8 302 140
10 8 429 198
9 8 454 220
8.5 8 440 215
10 8 455 225
8.5 8 390 190
MPG Model Model_Year Origin Weight
18 [1x36 char] 70 USA 3504
15 [1x36 char] 70 USA 3693
18 [1x36 char] 70 USA 3436
16 [1x36 char] 70 USA 3433
17 [1x36 char] 70 USA 3449
15 [1x36 char] 70 USA 4341
14 [1x36 char] 70 USA 4354
14 [1x36 char] 70 USA 4312
14 [1x36 char] 70 USA 4425
15 [1x36 char] 70 USA 3850
We can access individual columns by referencing them by their names...
clc
cars(1:10, {'Origin', 'MPG', 'Weight'})
ans =
Origin MPG Weight
USA 18 3504
USA 15 3693
USA 18 3436
USA 16 3433
USA 17 3449
USA 15 4341
USA 14 4354
USA 14 4312
USA 14 4425
USA 15 3850
The dot-notation allows you to extract the whole content of a variable.
clc cars.Horsepower(1:10)
ans = 130 165 150 150 140 198 220 215 225 190
Dataset objects store meta information.
get(cars)
Description: ''
VarDescription: {}
Units: {}
DimNames: {'Observations' 'Variables'}
UserData: []
ObsNames: {}
VarNames: {1x9 cell}
We can add dataset descriptions as well as units for the variables.
clc cars = set(cars, 'Description', 'Performance and structural information of automobiles'); cars = set(cars, 'Units', {'m/s^2', '', 'mm', 'hp', 'mpg', '', '', '', 'kg'}); summary(cars)
Performance and structural information of automobiles
Acceleration: [406x1 double, Units = m/s^2]
min 1st Q median 3rd Q max
8 13.7000 15.5000 17.2000 24.8000
Cylinders: [406x1 double]
min 1st Q median 3rd Q max
3 4 4 8 8
Displacement: [406x1 double, Units = mm]
min 1st Q median 3rd Q max
68 105 151 302 455
Horsepower: [406x1 double, Units = hp]
Columns 1 through 5
min 1st Q median 3rd Q max
46 75.5000 95 130 230
Column 6
NaNs
6
MPG: [406x1 double, Units = mpg]
Columns 1 through 5
min 1st Q median 3rd Q max
9 17.5000 23 29 46.6000
Column 6
NaNs
8
Model: [406x36 char]
Model_Year: [406x1 double]
min 1st Q median 3rd Q max
70 73 76 79 82
Origin: [406x7 char]
Weight: [406x1 double, Units = kg]
min 1st Q median 3rd Q max
1613 2226 2.8225e+003 3620 5140
Categorical Arrays
Notice that some of the variables take on discrete values. For instance, the Cylinders, and Origin take on a unique set of values:
clc disp('Cylinders:'); unique(cars(:, 'Cylinders')) disp('Origin:'); unique(cars(:, 'Origin'))
Cylinders:
ans =
Cylinders
3
4
5
6
8
Origin:
ans =
Origin
England
France
Germany
Italy
Japan
Sweden
USA
Categorical arrays provide significant memory savings. We will convert Cylinders to an ordinal array, which contains ordering information. The variable Origin will be converted to a nominal array, which does not store ordering.
clc Cylinders_cat = ordinal(Cylinders); Origin_cat = nominal(Origin); whos Cylinders* Origin*
Name Size Bytes Class Attributes Cylinders 406x1 3248 double Cylinders_cat 406x1 1178 ordinal Origin 406x7 5684 char Origin_cat 406x1 1366 nominal
Now, let's convert the variables of the dataset object.
cars.Cylinders = ordinal(cars.Cylinders); cars.Origin = nominal(cars.Origin);
Filtering
Dataset objects can be easily filtered by criteria.
For example, we can create a logical array that has ONEs where the origin is Germany and ZEROs where it's not Germany.
germanyMask = cars.Origin == 'Germany'
germanyMask =
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
1
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
1
1
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
Use mask to extract out all German cars.
clc cars(germanyMask, :)
ans =
Acceleration Cylinders Displacement Horsepower
20.5 4 97 46
14.5 4 107 90
12.5 4 121 113
20 4 97 48
14 4 116 90
19 4 97 60
23.5 4 97 54
18 4 121 76
21 4 97 46
15.5 4 116 75
14 4 114 91
16.5 4 98 83
15.5 4 79 67
14.5 4 97 78
16.5 4 90 71
14 4 90 70
15 4 115 95
16.9 4 116 81
14.2 4 90 70
12.2 4 97 71
16.7 6 168 120
14.5 4 97 78
14.1 4 97 78
12.8 4 121 110
21.5 4 90 48
15.9 5 131 103
14.9 4 89 71
14 4 89 71
20.1 5 183 77
14.7 4 98 76
15.8 4 97 78
21.7 4 90 48
23.7 4 90 48
19.9 5 121 67
21.8 4 146 67
15.3 4 89 62
14.2 4 105 74
15.3 4 105 74
24.6 4 97 52
MPG Model Model_Year Origin Weight
26 [1x36 char] 70 Germany 1835
24 [1x36 char] 70 Germany 2430
26 [1x36 char] 70 Germany 2234
NaN [1x36 char] 71 Germany 1978
28 [1x36 char] 71 Germany 2123
27 [1x36 char] 71 Germany 1834
23 [1x36 char] 72 Germany 2254
22 [1x36 char] 72 Germany 2511
26 [1x36 char] 73 Germany 1950
24 [1x36 char] 73 Germany 2158
20 [1x36 char] 73 Germany 2582
29 [1x36 char] 74 Germany 2219
26 [1x36 char] 74 Germany 1963
26 [1x36 char] 74 Germany 2300
25 [1x36 char] 75 Germany 2223
29 [1x36 char] 75 Germany 1937
23 [1x36 char] 75 Germany 2694
25 [1x36 char] 76 Germany 2220
29 [1x36 char] 76 Germany 1937
29.5 [1x36 char] 76 Germany 1825
16.5 [1x36 char] 76 Germany 3820
29 [1x36 char] 77 Germany 1940
30.5 [1x36 char] 77 Germany 2190
21.5 [1x36 char] 77 Germany 2600
43.1 [1x36 char] 78 Germany 1985
20.3 [1x36 char] 78 Germany 2830
31.5 [1x36 char] 78 Germany 1990
31.9 [1x36 char] 79 Germany 1925
25.4 [1x36 char] 79 Germany 3530
41.5 [1x36 char] 80 Germany 2144
34.3 [1x36 char] 80 Germany 2188
44.3 [1x36 char] 80 Germany 2085
43.4 [1x36 char] 80 Germany 2335
36.4 [1x36 char] 80 Germany 2950
30 [1x36 char] 80 Germany 3250
29.8 [1x36 char] 80 Germany 1845
33 [1x36 char] 81 Germany 2190
36 [1x36 char] 82 Germany 1980
44 [1x36 char] 82 Germany 2130
Scatter plot grouped by the year of the make.
gscatter(cars.MPG, cars.Weight, cars.Model_Year, '', 'xos'); xlabel('Miles per Gallon') ylabel('Weight')
We notice a general trend, but the amount of data prevents us from getting useful information.
We can use filtering to refine the visualization. Let's extract out only the cars made in 1970, 1976, or 1982.
index = cars.Model_Year == 70 | cars.Model_Year == 76 | cars.Model_Year == 82; filtered = cars(index,:);
We have a more meaningful scatter plot for this smaller subset.
gscatter(filtered.MPG, filtered.Weight, filtered.Model_Year, '', 'xos'); xlabel('Miles per Gallon') ylabel('Weight')
Add interactive case names to the plot
gname(filtered.Model);
Concatenate and Join
We can combine datasets by either concatenating or joining.
Concatenate
We have a different set of data that corresponds to small cars. Let's combine this with the original dataset. First, we'll create a dataset object from this data.
% load carsmall cs = load('carsmall.mat'); % create dataset and convert variables to categorical arrays cars_s = dataset(cs); cars_s.Origin = nominal(cars_s.Origin); cars_s.Cylinders = ordinal(cars_s.Cylinders); % add additional levels match the levels from the carbig dataset cars_s.Cylinders = addlevels(cars_s.Cylinders, getlabels(cars.Cylinders)); cars_s.Cylinders = reorderlevels(cars_s.Cylinders, getlabels(cars.Cylinders));
Warning: Ignoring duplicate levels in NEWLEVELS.
Concatenate using the matrix concatenation notation.
cars_all = [cars; cars_s]; % alternatively, % cars_all = vertcat(cars, cars_s);
Join
Joining allows you to take the data in one dataset array and assign it to the rows of another dataset array, based on matching values in a common key variable.
clc tabulate(cars_all.Origin);
Value Count Percent
England 1 0.20%
France 18 3.56%
Germany 48 9.49%
Italy 9 1.78%
Japan 94 18.58%
Sweden 13 2.57%
USA 323 63.83%
Create a new dataset that maps countries to continents.
clc Newdata = dataset(... {nominal({'England';'France';'Germany';'Italy' ;'Japan';'Sweden';'USA' }),'Origin' }, ... {nominal({'Europe' ;'Europe';'Europe' ;'Europe';'Asia' ;'Europe';'North America'}),'Continent'})
Newdata =
Origin Continent
England Europe
France Europe
Germany Europe
Italy Europe
Japan Asia
Sweden Europe
USA North America
Join the two datasets to include Continent as a new variable.
cars_all = join(cars_all, Newdata); clc cars_all(1:10:100, :)
ans =
Acceleration Cylinders Displacement Horsepower
12 8 307 130
17.5 4 133 115
15 4 113 95
15 6 199 90
13 6 232 100
12 8 400 170
19 4 71 65
12 8 400 175
14 8 307 130
15 4 98 80
MPG Model Model_Year Origin Weight
18 [1x36 char] 70 USA 3504
NaN [1x36 char] 70 France 3090
24 [1x36 char] 70 Japan 2372
21 [1x36 char] 70 USA 2648
19 [1x36 char] 71 USA 2634
13 [1x36 char] 71 USA 4746
31 [1x36 char] 71 Japan 1773
14 [1x36 char] 72 USA 4385
13 [1x36 char] 72 USA 4098
28 [1x36 char] 72 USA 2164
Continent
North America
Europe
Asia
North America
North America
North America
Asia
North America
North America
North America
Dealing with Missing Data
Notice that we have some missing data in our MPG data
clc
cars(5:20, 'MPG')
ans =
MPG
17
15
14
14
14
15
NaN
NaN
NaN
NaN
NaN
15
14
NaN
15
14
One way to deal with missing data is to substitute for the missing value. In this case, we will create a regression model to represent the performance measures (MPG) as functions of possible predictor variables (acceleration, cylinders, horsepower, model year, and weight)
X = [ones(length(cars.MPG),1) cars.Acceleration, double(cars.Cylinders), ...
cars.Displacement, cars.Horsepower, cars.Model_Year, cars.Weight];
Y = [cars.MPG];
[b,bint,r,rint,stats] = regress(Y, X);
Note that cars.Horsepower contains NaNs. The regress function performs listwise deletion on the independent variables
cars.regress = X * b;
fprintf('R-squared: %f\n', stats(1));
R-squared: 0.814178
Examine the residual.
residuals = cars.MPG - cars.regress; stem(cars.regress, residuals); xlabel('model'); ylabel('actual - model');
For cars with low or high MPG, the model seems to underestimate the MPG, while for cars in the middle, the model overestimates the true value.
gname(cars.Model)
Clean Up Our Model
We can potentially improve the model by adding dummy variables to handle diesels, automatic transmissions, and station wagons. In addition, we can filter out 3 and 5 cylinder engines, which are rotary engines.
Dummy variable is a binary variable that has a "1" where it satisfies the criteria and "0" everywhere else.
% Load in as a dataset object ds = dataset('file','dummy.txt'); % Concatenate carsall = [cars, ds]; carsall = set(carsall, 'Units', [get(cars, 'Units'), {'', '', '', ''}]); % Filter out 3- and 5- cylinder engines index = carsall.Cylinders == '4' | carsall.Cylinders == '6' | carsall.Cylinders == '8'; carsall = carsall(index,:);
Create a New Regression Model
Create a new regression model just by looking at 4, 6, and 8 cylinders and taking into account the car type (station wagon, diesel, automatic).
X = [ones(length(carsall.MPG),1), carsall.Acceleration, ... double(carsall.Cylinders), carsall.Displacement, carsall.Horsepower, ... carsall.Model_Year, carsall.Weight, carsall.SW, carsall.Diesel, ... carsall.Automatic]; Y = [carsall.MPG]; [b, bint, r, rint, stats] = regress(Y, X); carsall.regress = X * b; residuals2 = carsall.MPG - carsall.regress; stem(carsall.regress, residuals2) xlabel('model'); ylabel('actual - model'); gname(carsall.Model)
Multivariate Analysis of Variance
Multivariate analysis of variance to see how similar the cars from various countries are, in terms of MPG, Acceleration, Weight, and Displacement.
X = [carsall.MPG, carsall.Acceleration, carsall.Weight, carsall.Displacement]; [d, p, stats] = manova1(X, carsall.Origin); manovacluster(stats)
We see that Japanese and German cars are quite similar, and they are very different from English and American cars
Let's add another dummy variable that distinguished Japanese and German cars. Then redo the regression.
carsall.dummy = (carsall.Origin == 'Germany' | carsall.Origin == 'Japan'); X = [ones(length(carsall.MPG),1), carsall.Acceleration, ... double(carsall.Cylinders), carsall.Displacement, carsall.Horsepower, ... carsall.Model_Year, carsall.Weight, carsall.SW, carsall.Diesel, ... carsall.Automatic carsall.dummy]; Y = [carsall.MPG]; [b, bint, r, rint, stats] = regress(Y, X); carsall.regress = X * b; % Inspect once again residuals2 = carsall.MPG - carsall.regress; stem(carsall.regress, residuals2) xlabel('model'); ylabel('actual - model'); gname(carsall.Model)
Robust Regression
We can also perform robust regression to deal with the outliers that may exist in the dataset.
X2 = [carsall.Acceleration, double(carsall.Cylinders), ... carsall.Displacement, carsall.Horsepower, carsall.Model_Year, ... carsall.Weight, carsall.SW, carsall.Diesel, carsall.Automatic, carsall.dummy]; [robustbeta, stats] = robustfit(X2, Y) X3 = [ones(length(carsall.MPG),1), X2]; carsall.regress2 = X3 * robustbeta;
robustbeta =
-4.1872
-0.2086
-1.6006
0.0126
-0.0166
0.6444
-0.0048
0.1167
12.4719
-2.6949
1.7508
stats =
ols_s: 2.9910
robust_s: 2.8012
mad_s: 2.6344
s: 2.8477
resid: [399x1 double]
rstud: [399x1 double]
se: [11x1 double]
covb: [11x11 double]
coeffcorr: [11x11 double]
t: [11x1 double]
p: [11x1 double]
w: [399x1 double]
R: [11x11 double]
dfe: 374
h: [399x1 double]
Perform Regression Substitution
We have been looking at linear regressions so far, but we might be able to apply some nonlinear regressions to get a better predictive model.
Once we have a regression model, we can go ahead substitute the missing values with the model data.
carsall.mask = isnan(carsall.MPG);
carsall.MPG(carsall.mask) = carsall.regress2(carsall.mask);
carsall(5:20, 'MPG')
ans =
MPG
17
15
14
14
14
15
18.881
12.256
13.095
12.597
13.732
15
14
16.498
15
14
