MATLAB Examples

# Solving Data Management and Analysis Challenges using MATLAB® and Statistics Toolbox™

Demo file for the Data Management and Statistics Webinar. This demo requires the Statistics Toolbox and was created using MATLAB 7.7 (R2008b).

In this demo, we will see how we can take a set of data describing performance and characteristics of various cars, and organize, extract, and visualize useful information for further analysis.

## Automobile Data

Now let's begin. We'll work with this MAT file which contains some automobile data.

```clear; load carbig whos ```
``` Name Size Bytes Class Attributes Acceleration 406x1 3248 double Cylinders 406x1 3248 double Displacement 406x1 3248 double Horsepower 406x1 3248 double MPG 406x1 3248 double Model 406x36 29232 char Model_Year 406x1 3248 double Origin 406x7 5684 char Weight 406x1 3248 double cyl4 406x5 4060 char org 406x7 5684 char when 406x5 4060 char ```

This data set contains information regarding 406 different cars. There are different variables for each piece of information, and each row corresponds to the same car.

## Dataset Object

Dataset objects allow you to organize information in a tabular format, and have structures very much like that of matrices. Each row represents the observations, or "cars" in this case, and the columns represent the variables, with the appropriate header names.

```clc; cars = dataset(Acceleration, Cylinders, Displacement, Horsepower, ... MPG, Model, Model_Year, Origin, Weight) ```
```cars = Acceleration Cylinders Displacement Horsepower 12 8 307 130 11.5 8 350 165 11 8 318 150 12 8 304 150 10.5 8 302 140 10 8 429 198 9 8 454 220 8.5 8 440 215 10 8 455 225 8.5 8 390 190 17.5 4 133 115 11.5 8 350 165 11 8 351 153 10.5 8 383 175 11 8 360 175 10 8 383 170 8 8 340 160 8 8 302 140 9.5 8 400 150 10 8 455 225 15 4 113 95 15.5 6 198 95 15.5 6 199 97 16 6 200 85 14.5 4 97 88 20.5 4 97 46 17.5 4 110 87 14.5 4 107 90 17.5 4 104 95 12.5 4 121 113 15 6 199 90 14 8 360 215 15 8 307 200 13.5 8 318 210 18.5 8 304 193 14.5 4 97 88 15.5 4 140 90 14 4 113 95 19 4 98 NaN 20 4 97 48 13 6 232 100 15.5 6 225 105 15.5 6 250 100 15.5 6 250 88 15.5 6 232 100 12 8 350 165 11.5 8 400 175 13.5 8 351 153 13 8 318 150 11.5 8 383 180 12 8 400 170 12 8 400 175 13.5 6 258 110 19 4 140 72 15 6 250 100 14.5 6 250 88 14 4 122 86 14 4 116 90 19.5 4 79 70 14.5 4 88 76 19 4 71 65 18 4 72 69 19 4 97 60 20.5 4 91 70 15.5 4 113 95 17 4 97.5 80 23.5 4 97 54 19.5 4 140 90 16.5 4 122 86 12 8 350 165 12 8 400 175 13.5 8 318 150 13 8 351 153 11.5 8 304 150 11 8 429 208 13.5 8 350 155 13.5 8 350 160 12.5 8 400 190 13.5 3 70 97 12.5 8 304 150 14 8 307 130 16 8 302 140 14 8 318 150 14.5 4 121 112 18 4 121 76 19.5 4 120 87 18 4 96 69 16 4 122 86 17 4 97 92 14.5 4 120 97 15 4 98 80 16.5 4 97 88 13 8 350 175 11.5 8 304 150 13 8 350 145 14.5 8 302 137 12.5 8 318 150 11.5 8 429 198 12 8 400 150 13 8 351 158 14.5 8 318 150 11 8 440 215 11 8 455 225 11 8 360 175 16.5 6 225 105 18 6 250 100 16 6 232 100 16.5 6 250 88 16 6 198 95 21 4 97 46 14 8 400 150 12.5 8 400 167 13 8 360 170 12.5 8 350 180 15 6 232 100 19 4 97 88 19.5 4 140 72 16.5 4 108 94 13.5 3 70 90 18.5 4 122 85 14 6 155 107 15.5 4 98 90 13 8 350 145 9.5 8 400 230 19.5 4 68 49 15.5 4 116 75 14 4 114 91 15.5 4 121 112 11 8 318 150 14 4 121 110 13.5 6 156 122 11 8 350 180 16.5 6 198 95 17 6 200 NaN 16 6 232 100 17 6 250 100 19 4 79 67 16.5 4 122 80 21 4 71 65 17 4 140 75 17 6 250 100 18 6 258 110 16.5 6 225 105 14 8 302 140 14.5 8 350 150 13.5 8 318 150 16 8 302 140 15.5 8 304 150 16.5 4 98 83 15.5 4 79 67 14.5 4 97 78 16.5 4 76 52 19 4 83 61 14.5 4 90 75 15.5 4 90 75 14 4 116 75 15 4 120 97 15.5 4 108 93 16 4 79 67 16 6 225 95 16 6 250 105 21 6 250 72 19.5 6 250 72 11.5 8 400 170 14 8 350 145 14.5 8 318 150 13.5 8 351 148 21 6 231 110 18.5 6 250 105 19 6 258 110 19 6 225 95 15 6 231 110 13.5 8 262 110 12 8 302 129 16 4 97 75 17 4 140 83 16 6 232 100 18.5 4 140 78 13.5 4 134 96 16.5 4 90 71 17 4 119 97 14.5 6 171 97 14 4 90 70 17 6 232 90 15 4 115 95 17 4 120 88 14.5 4 121 98 13.5 4 121 115 17.5 4 91 53 15.5 4 107 86 16.9 4 116 81 14.9 4 140 92 17.7 4 98 79 15.3 4 101 83 13 8 305 140 13 8 318 150 13.9 8 304 120 12.8 8 351 152 15.4 6 225 100 14.5 6 250 105 17.6 6 200 81 17.6 6 232 90 22.2 4 85 52 22.1 4 98 60 14.2 4 90 70 17.4 4 91 53 17.7 6 225 100 21 6 250 78 16.2 6 250 110 17.8 6 258 95 12.2 4 97 71 17 4 85 70 16.4 4 97 75 13.6 4 140 72 15.7 4 130 102 13.2 8 318 150 21.9 4 120 88 15.5 6 156 108 16.7 6 168 120 12.1 8 350 180 12 8 350 145 15 8 302 130 14 8 318 150 18.5 4 98 68 14.8 4 111 80 18.6 4 79 58 15.5 4 122 96 16.8 4 85 70 12.5 8 305 145 19 8 260 110 13.7 8 318 145 14.9 8 302 130 16.4 6 250 110 16.9 6 231 105 17.7 6 225 100 19 6 250 98 11.1 8 400 180 11.4 8 350 170 12.2 8 400 190 14.5 8 351 149 14.5 4 97 78 16 4 151 88 18.2 4 97 75 15.8 4 140 89 17 4 98 63 15.9 4 98 83 16.4 4 97 67 14.1 4 97 78 14.5 6 146 97 12.8 4 121 110 13.5 3 80 110 21.5 4 90 48 14.4 4 98 66 19.4 4 78 52 18.6 4 85 70 16.4 4 91 60 15.5 8 260 110 13.2 8 318 140 12.8 8 302 139 19.2 6 231 105 18.2 6 200 95 15.8 6 200 85 15.4 4 140 88 17.2 6 225 100 17.2 6 232 90 15.8 6 231 105 16.7 6 200 85 18.7 6 225 110 15.1 6 258 120 13.2 8 305 145 13.4 6 231 165 11.2 8 302 139 13.7 8 318 140 16.5 4 98 68 14.2 4 134 95 14.7 4 119 97 14.5 4 105 75 14.8 4 134 95 16.7 4 156 105 17.6 4 151 85 14.9 4 119 97 15.9 5 131 103 13.6 6 163 125 15.7 4 121 115 15.8 6 163 133 14.9 4 89 71 16.6 4 98 68 15.4 6 231 115 18.2 6 200 85 17.3 4 140 88 18.2 6 232 90 16.6 6 225 110 15.4 8 305 130 13.4 8 302 129 13.2 8 351 138 15.2 8 318 135 14.9 8 350 155 14.3 8 351 142 15 8 267 125 13 8 360 150 14 4 89 71 15.2 4 86 65 14.4 4 98 80 15 4 121 80 20.1 5 183 77 17.4 8 350 125 24.8 4 141 71 22.2 8 260 90 13.2 4 105 70 14.9 4 105 70 19.2 4 85 65 14.7 4 91 69 16 4 151 90 11.3 6 173 115 12.9 6 173 115 13.2 4 151 90 14.7 4 98 76 18.8 4 89 60 15.5 4 98 70 16.4 4 86 65 16.5 4 151 90 18.1 4 140 88 20.1 4 151 90 18.7 6 225 90 15.8 4 97 78 15.5 4 134 90 17.5 4 120 75 15 4 119 92 15.2 4 108 75 17.9 4 86 65 14.4 4 156 105 19.2 4 85 65 21.7 4 90 48 23.7 4 90 48 19.9 5 121 67 21.8 4 146 67 13.8 4 91 67 17.3 4 85 NaN 18 4 97 67 15.3 4 89 62 11.4 6 168 132 12.5 3 70 100 15.1 4 122 88 14.3 4 140 NaN 17 4 107 72 15.7 4 135 84 16.4 4 151 84 14.4 4 156 92 12.6 6 173 110 12.9 4 135 84 16.9 4 79 58 16.4 4 86 64 16.1 4 81 60 17.8 4 97 67 19.4 4 85 65 17.3 4 89 62 16 4 91 68 14.9 4 105 63 16.2 4 98 65 20.7 4 98 65 14.2 4 105 74 15.8 4 100 NaN 14.4 4 107 75 16.8 4 108 75 14.8 4 119 100 18.3 4 120 74 20.4 4 141 80 15.4 4 121 110 19.6 6 145 76 12.6 6 168 116 13.8 6 146 120 15.8 6 231 110 19 8 350 105 17.1 6 200 88 16.6 6 225 85 19.6 4 112 88 18.6 4 112 88 18 4 112 88 16.2 4 112 85 16 4 135 84 18 4 151 90 16.4 4 140 92 20.5 4 151 NaN 15.3 4 105 74 18.2 4 91 68 17.6 4 91 68 14.7 4 105 63 17.3 4 98 70 14.5 4 120 88 14.5 4 107 75 16.9 4 108 70 15 4 91 67 15.7 4 91 67 16.2 4 91 67 16.4 6 181 110 17 6 262 85 14.5 4 156 92 14.7 6 232 112 13.9 4 144 96 13 4 135 84 17.3 4 151 90 15.6 4 140 86 24.6 4 97 52 11.6 4 135 84 18.6 4 120 79 19.4 4 119 82 MPG Model Model_Year Origin Weight 18 [1x36 char] 70 USA 3504 15 [1x36 char] 70 USA 3693 18 [1x36 char] 70 USA 3436 16 [1x36 char] 70 USA 3433 17 [1x36 char] 70 USA 3449 15 [1x36 char] 70 USA 4341 14 [1x36 char] 70 USA 4354 14 [1x36 char] 70 USA 4312 14 [1x36 char] 70 USA 4425 15 [1x36 char] 70 USA 3850 NaN [1x36 char] 70 France 3090 NaN [1x36 char] 70 USA 4142 NaN [1x36 char] 70 USA 4034 NaN [1x36 char] 70 USA 4166 NaN [1x36 char] 70 USA 3850 15 [1x36 char] 70 USA 3563 14 [1x36 char] 70 USA 3609 NaN [1x36 char] 70 USA 3353 15 [1x36 char] 70 USA 3761 14 [1x36 char] 70 USA 3086 24 [1x36 char] 70 Japan 2372 22 [1x36 char] 70 USA 2833 18 [1x36 char] 70 USA 2774 21 [1x36 char] 70 USA 2587 27 [1x36 char] 70 Japan 2130 26 [1x36 char] 70 Germany 1835 25 [1x36 char] 70 France 2672 24 [1x36 char] 70 Germany 2430 25 [1x36 char] 70 Sweden 2375 26 [1x36 char] 70 Germany 2234 21 [1x36 char] 70 USA 2648 10 [1x36 char] 70 USA 4615 10 [1x36 char] 70 USA 4376 11 [1x36 char] 70 USA 4382 9 [1x36 char] 70 USA 4732 27 [1x36 char] 71 Japan 2130 28 [1x36 char] 71 USA 2264 25 [1x36 char] 71 Japan 2228 25 [1x36 char] 71 USA 2046 NaN [1x36 char] 71 Germany 1978 19 [1x36 char] 71 USA 2634 16 [1x36 char] 71 USA 3439 17 [1x36 char] 71 USA 3329 19 [1x36 char] 71 USA 3302 18 [1x36 char] 71 USA 3288 14 [1x36 char] 71 USA 4209 14 [1x36 char] 71 USA 4464 14 [1x36 char] 71 USA 4154 14 [1x36 char] 71 USA 4096 12 [1x36 char] 71 USA 4955 13 [1x36 char] 71 USA 4746 13 [1x36 char] 71 USA 5140 18 [1x36 char] 71 USA 2962 22 [1x36 char] 71 USA 2408 19 [1x36 char] 71 USA 3282 18 [1x36 char] 71 USA 3139 23 [1x36 char] 71 USA 2220 28 [1x36 char] 71 Germany 2123 30 [1x36 char] 71 France 2074 30 [1x36 char] 71 Italy 2065 31 [1x36 char] 71 Japan 1773 35 [1x36 char] 71 Japan 1613 27 [1x36 char] 71 Germany 1834 26 [1x36 char] 71 USA 1955 24 [1x36 char] 72 Japan 2278 25 [1x36 char] 72 USA 2126 23 [1x36 char] 72 Germany 2254 20 [1x36 char] 72 USA 2408 21 [1x36 char] 72 USA 2226 13 [1x36 char] 72 USA 4274 14 [1x36 char] 72 USA 4385 15 [1x36 char] 72 USA 4135 14 [1x36 char] 72 USA 4129 17 [1x36 char] 72 USA 3672 11 [1x36 char] 72 USA 4633 13 [1x36 char] 72 USA 4502 12 [1x36 char] 72 USA 4456 13 [1x36 char] 72 USA 4422 19 [1x36 char] 72 Japan 2330 15 [1x36 char] 72 USA 3892 13 [1x36 char] 72 USA 4098 13 [1x36 char] 72 USA 4294 14 [1x36 char] 72 USA 4077 18 [1x36 char] 72 Sweden 2933 22 [1x36 char] 72 Germany 2511 21 [1x36 char] 72 France 2979 26 [1x36 char] 72 France 2189 22 [1x36 char] 72 USA 2395 28 [1x36 char] 72 Japan 2288 23 [1x36 char] 72 Japan 2506 28 [1x36 char] 72 USA 2164 27 [1x36 char] 72 Japan 2100 13 [1x36 char] 73 USA 4100 14 [1x36 char] 73 USA 3672 13 [1x36 char] 73 USA 3988 14 [1x36 char] 73 USA 4042 15 [1x36 char] 73 USA 3777 12 [1x36 char] 73 USA 4952 13 [1x36 char] 73 USA 4464 13 [1x36 char] 73 USA 4363 14 [1x36 char] 73 USA 4237 13 [1x36 char] 73 USA 4735 12 [1x36 char] 73 USA 4951 13 [1x36 char] 73 USA 3821 18 [1x36 char] 73 USA 3121 16 [1x36 char] 73 USA 3278 18 [1x36 char] 73 USA 2945 18 [1x36 char] 73 USA 3021 23 [1x36 char] 73 USA 2904 26 [1x36 char] 73 Germany 1950 11 [1x36 char] 73 USA 4997 12 [1x36 char] 73 USA 4906 13 [1x36 char] 73 USA 4654 12 [1x36 char] 73 USA 4499 18 [1x36 char] 73 USA 2789 20 [1x36 char] 73 Japan 2279 21 [1x36 char] 73 USA 2401 22 [1x36 char] 73 Japan 2379 18 [1x36 char] 73 Japan 2124 19 [1x36 char] 73 USA 2310 21 [1x36 char] 73 USA 2472 26 [1x36 char] 73 Italy 2265 15 [1x36 char] 73 USA 4082 16 [1x36 char] 73 USA 4278 29 [1x36 char] 73 Italy 1867 24 [1x36 char] 73 Germany 2158 20 [1x36 char] 73 Germany 2582 19 [1x36 char] 73 Sweden 2868 15 [1x36 char] 73 USA 3399 24 [1x36 char] 73 Sweden 2660 20 [1x36 char] 73 Japan 2807 11 [1x36 char] 73 USA 3664 20 [1x36 char] 74 USA 3102 21 [1x36 char] 74 USA 2875 19 [1x36 char] 74 USA 2901 15 [1x36 char] 74 USA 3336 31 [1x36 char] 74 Japan 1950 26 [1x36 char] 74 USA 2451 32 [1x36 char] 74 Japan 1836 25 [1x36 char] 74 USA 2542 16 [1x36 char] 74 USA 3781 16 [1x36 char] 74 USA 3632 18 [1x36 char] 74 USA 3613 16 [1x36 char] 74 USA 4141 13 [1x36 char] 74 USA 4699 14 [1x36 char] 74 USA 4457 14 [1x36 char] 74 USA 4638 14 [1x36 char] 74 USA 4257 29 [1x36 char] 74 Germany 2219 26 [1x36 char] 74 Germany 1963 26 [1x36 char] 74 Germany 2300 31 [1x36 char] 74 Japan 1649 32 [1x36 char] 74 Japan 2003 28 [1x36 char] 74 USA 2125 24 [1x36 char] 74 Italy 2108 26 [1x36 char] 74 Italy 2246 24 [1x36 char] 74 Japan 2489 26 [1x36 char] 74 Japan 2391 31 [1x36 char] 74 Italy 2000 19 [1x36 char] 75 USA 3264 18 [1x36 char] 75 USA 3459 15 [1x36 char] 75 USA 3432 15 [1x36 char] 75 USA 3158 16 [1x36 char] 75 USA 4668 15 [1x36 char] 75 USA 4440 16 [1x36 char] 75 USA 4498 14 [1x36 char] 75 USA 4657 17 [1x36 char] 75 USA 3907 16 [1x36 char] 75 USA 3897 15 [1x36 char] 75 USA 3730 18 [1x36 char] 75 USA 3785 21 [1x36 char] 75 USA 3039 20 [1x36 char] 75 USA 3221 13 [1x36 char] 75 USA 3169 29 [1x36 char] 75 Japan 2171 23 [1x36 char] 75 USA 2639 20 [1x36 char] 75 USA 2914 23 [1x36 char] 75 USA 2592 24 [1x36 char] 75 Japan 2702 25 [1x36 char] 75 Germany 2223 24 [1x36 char] 75 Japan 2545 18 [1x36 char] 75 USA 2984 29 [1x36 char] 75 Germany 1937 19 [1x36 char] 75 USA 3211 23 [1x36 char] 75 Germany 2694 23 [1x36 char] 75 France 2957 22 [1x36 char] 75 Sweden 2945 25 [1x36 char] 75 Sweden 2671 33 [1x36 char] 75 Japan 1795 28 [1x36 char] 76 Italy 2464 25 [1x36 char] 76 Germany 2220 25 [1x36 char] 76 USA 2572 26 [1x36 char] 76 USA 2255 27 [1x36 char] 76 France 2202 17.5 [1x36 char] 76 USA 4215 16 [1x36 char] 76 USA 4190 15.5 [1x36 char] 76 USA 3962 14.5 [1x36 char] 76 USA 4215 22 [1x36 char] 76 USA 3233 22 [1x36 char] 76 USA 3353 24 [1x36 char] 76 USA 3012 22.5 [1x36 char] 76 USA 3085 29 [1x36 char] 76 USA 2035 24.5 [1x36 char] 76 USA 2164 29 [1x36 char] 76 Germany 1937 33 [1x36 char] 76 Japan 1795 20 [1x36 char] 76 USA 3651 18 [1x36 char] 76 USA 3574 18.5 [1x36 char] 76 USA 3645 17.5 [1x36 char] 76 USA 3193 29.5 [1x36 char] 76 Germany 1825 32 [1x36 char] 76 Japan 1990 28 [1x36 char] 76 Japan 2155 26.5 [1x36 char] 76 USA 2565 20 [1x36 char] 76 Sweden 3150 13 [1x36 char] 76 USA 3940 19 [1x36 char] 76 France 3270 19 [1x36 char] 76 Japan 2930 16.5 [1x36 char] 76 Germany 3820 16.5 [1x36 char] 76 USA 4380 13 [1x36 char] 76 USA 4055 13 [1x36 char] 76 USA 3870 13 [1x36 char] 76 USA 3755 31.5 [1x36 char] 77 Japan 2045 30 [1x36 char] 77 USA 2155 36 [1x36 char] 77 France 1825 25.5 [1x36 char] 77 USA 2300 33.5 [1x36 char] 77 Japan 1945 17.5 [1x36 char] 77 USA 3880 17 [1x36 char] 77 USA 4060 15.5 [1x36 char] 77 USA 4140 15 [1x36 char] 77 USA 4295 17.5 [1x36 char] 77 USA 3520 20.5 [1x36 char] 77 USA 3425 19 [1x36 char] 77 USA 3630 18.5 [1x36 char] 77 USA 3525 16 [1x36 char] 77 USA 4220 15.5 [1x36 char] 77 USA 4165 15.5 [1x36 char] 77 USA 4325 16 [1x36 char] 77 USA 4335 29 [1x36 char] 77 Germany 1940 24.5 [1x36 char] 77 USA 2740 26 [1x36 char] 77 Japan 2265 25.5 [1x36 char] 77 USA 2755 30.5 [1x36 char] 77 USA 2051 33.5 [1x36 char] 77 USA 2075 30 [1x36 char] 77 Japan 1985 30.5 [1x36 char] 77 Germany 2190 22 [1x36 char] 77 Japan 2815 21.5 [1x36 char] 77 Germany 2600 21.5 [1x36 char] 77 Japan 2720 43.1 [1x36 char] 78 Germany 1985 36.1 [1x36 char] 78 USA 1800 32.8 [1x36 char] 78 Japan 1985 39.4 [1x36 char] 78 Japan 2070 36.1 [1x36 char] 78 Japan 1800 19.9 [1x36 char] 78 USA 3365 19.4 [1x36 char] 78 USA 3735 20.2 [1x36 char] 78 USA 3570 19.2 [1x36 char] 78 USA 3535 20.5 [1x36 char] 78 USA 3155 20.2 [1x36 char] 78 USA 2965 25.1 [1x36 char] 78 USA 2720 20.5 [1x36 char] 78 USA 3430 19.4 [1x36 char] 78 USA 3210 20.6 [1x36 char] 78 USA 3380 20.8 [1x36 char] 78 USA 3070 18.6 [1x36 char] 78 USA 3620 18.1 [1x36 char] 78 USA 3410 19.2 [1x36 char] 78 USA 3425 17.7 [1x36 char] 78 USA 3445 18.1 [1x36 char] 78 USA 3205 17.5 [1x36 char] 78 USA 4080 30 [1x36 char] 78 USA 2155 27.5 [1x36 char] 78 Japan 2560 27.2 [1x36 char] 78 Japan 2300 30.9 [1x36 char] 78 USA 2230 21.1 [1x36 char] 78 Japan 2515 23.2 [1x36 char] 78 USA 2745 23.8 [1x36 char] 78 USA 2855 23.9 [1x36 char] 78 Japan 2405 20.3 [1x36 char] 78 Germany 2830 17 [1x36 char] 78 Sweden 3140 21.6 [1x36 char] 78 Sweden 2795 16.2 [1x36 char] 78 France 3410 31.5 [1x36 char] 78 Germany 1990 29.5 [1x36 char] 78 Japan 2135 21.5 [1x36 char] 79 USA 3245 19.8 [1x36 char] 79 USA 2990 22.3 [1x36 char] 79 USA 2890 20.2 [1x36 char] 79 USA 3265 20.6 [1x36 char] 79 USA 3360 17 [1x36 char] 79 USA 3840 17.6 [1x36 char] 79 USA 3725 16.5 [1x36 char] 79 USA 3955 18.2 [1x36 char] 79 USA 3830 16.9 [1x36 char] 79 USA 4360 15.5 [1x36 char] 79 USA 4054 19.2 [1x36 char] 79 USA 3605 18.5 [1x36 char] 79 USA 3940 31.9 [1x36 char] 79 Germany 1925 34.1 [1x36 char] 79 Japan 1975 35.7 [1x36 char] 79 USA 1915 27.4 [1x36 char] 79 USA 2670 25.4 [1x36 char] 79 Germany 3530 23 [1x36 char] 79 USA 3900 27.2 [1x36 char] 79 France 3190 23.9 [1x36 char] 79 USA 3420 34.2 [1x36 char] 79 USA 2200 34.5 [1x36 char] 79 USA 2150 31.8 [1x36 char] 79 Japan 2020 37.3 [1x36 char] 79 Italy 2130 28.4 [1x36 char] 79 USA 2670 28.8 [1x36 char] 79 USA 2595 26.8 [1x36 char] 79 USA 2700 33.5 [1x36 char] 79 USA 2556 41.5 [1x36 char] 80 Germany 2144 38.1 [1x36 char] 80 Japan 1968 32.1 [1x36 char] 80 USA 2120 37.2 [1x36 char] 80 Japan 2019 28 [1x36 char] 80 USA 2678 26.4 [1x36 char] 80 USA 2870 24.3 [1x36 char] 80 USA 3003 19.1 [1x36 char] 80 USA 3381 34.3 [1x36 char] 80 Germany 2188 29.8 [1x36 char] 80 Japan 2711 31.3 [1x36 char] 80 Japan 2542 37 [1x36 char] 80 Japan 2434 32.2 [1x36 char] 80 Japan 2265 46.6 [1x36 char] 80 Japan 2110 27.9 [1x36 char] 80 USA 2800 40.8 [1x36 char] 80 Japan 2110 44.3 [1x36 char] 80 Germany 2085 43.4 [1x36 char] 80 Germany 2335 36.4 [1x36 char] 80 Germany 2950 30 [1x36 char] 80 Germany 3250 44.6 [1x36 char] 80 Japan 1850 40.9 [1x36 char] 80 France 1835 33.8 [1x36 char] 80 Japan 2145 29.8 [1x36 char] 80 Germany 1845 32.7 [1x36 char] 80 Japan 2910 23.7 [1x36 char] 80 Japan 2420 35 [1x36 char] 80 England 2500 23.6 [1x36 char] 80 USA 2905 32.4 [1x36 char] 80 Japan 2290 27.2 [1x36 char] 81 USA 2490 26.6 [1x36 char] 81 USA 2635 25.8 [1x36 char] 81 USA 2620 23.5 [1x36 char] 81 USA 2725 30 [1x36 char] 81 USA 2385 39.1 [1x36 char] 81 Japan 1755 39 [1x36 char] 81 USA 1875 35.1 [1x36 char] 81 Japan 1760 32.3 [1x36 char] 81 Japan 2065 37 [1x36 char] 81 Japan 1975 37.7 [1x36 char] 81 Japan 2050 34.1 [1x36 char] 81 Japan 1985 34.7 [1x36 char] 81 USA 2215 34.4 [1x36 char] 81 USA 2045 29.9 [1x36 char] 81 USA 2380 33 [1x36 char] 81 Germany 2190 34.5 [1x36 char] 81 France 2320 33.7 [1x36 char] 81 Japan 2210 32.4 [1x36 char] 81 Japan 2350 32.9 [1x36 char] 81 Japan 2615 31.6 [1x36 char] 81 Japan 2635 28.1 [1x36 char] 81 France 3230 NaN [1x36 char] 81 Sweden 2800 30.7 [1x36 char] 81 Sweden 3160 25.4 [1x36 char] 81 Japan 2900 24.2 [1x36 char] 81 Japan 2930 22.4 [1x36 char] 81 USA 3415 26.6 [1x36 char] 81 USA 3725 20.2 [1x36 char] 81 USA 3060 17.6 [1x36 char] 81 USA 3465 28 [1x36 char] 82 USA 2605 27 [1x36 char] 82 USA 2640 34 [1x36 char] 82 USA 2395 31 [1x36 char] 82 USA 2575 29 [1x36 char] 82 USA 2525 27 [1x36 char] 82 USA 2735 24 [1x36 char] 82 USA 2865 23 [1x36 char] 82 USA 3035 36 [1x36 char] 82 Germany 1980 37 [1x36 char] 82 Japan 2025 31 [1x36 char] 82 Japan 1970 38 [1x36 char] 82 USA 2125 36 [1x36 char] 82 USA 2125 36 [1x36 char] 82 Japan 2160 36 [1x36 char] 82 Japan 2205 34 [1x36 char] 82 Japan 2245 38 [1x36 char] 82 Japan 1965 32 [1x36 char] 82 Japan 1965 38 [1x36 char] 82 Japan 1995 25 [1x36 char] 82 USA 2945 38 [1x36 char] 82 USA 3015 26 [1x36 char] 82 USA 2585 22 [1x36 char] 82 USA 2835 32 [1x36 char] 82 Japan 2665 36 [1x36 char] 82 USA 2370 27 [1x36 char] 82 USA 2950 27 [1x36 char] 82 USA 2790 44 [1x36 char] 82 Germany 2130 32 [1x36 char] 82 USA 2295 28 [1x36 char] 82 USA 2625 31 [1x36 char] 82 USA 2720 ```

The summary function provides basic statistical information for each of the variables included in the dataset object. Notice that there are some missing values for Horsepower and MPG, denoted by NaNs.

```clc summary(cars); ```
```Acceleration: [406x1 double] min 1st Q median 3rd Q max 8 13.7000 15.5000 17.2000 24.8000 Cylinders: [406x1 double] min 1st Q median 3rd Q max 3 4 4 8 8 Displacement: [406x1 double] min 1st Q median 3rd Q max 68 105 151 302 455 Horsepower: [406x1 double] Columns 1 through 5 min 1st Q median 3rd Q max 46 75.5000 95 130 230 Column 6 NaNs 6 MPG: [406x1 double] Columns 1 through 5 min 1st Q median 3rd Q max 9 17.5000 23 29 46.6000 Column 6 NaNs 8 Model: [406x36 char] Model_Year: [406x1 double] min 1st Q median 3rd Q max 70 73 76 79 82 Origin: [406x7 char] Weight: [406x1 double] min 1st Q median 3rd Q max 1613 2226 2.8225e+003 3620 5140 ```

We can index into a dataset object like a regular matrix.

```clc cars(1:10, :) ```
```ans = Acceleration Cylinders Displacement Horsepower 12 8 307 130 11.5 8 350 165 11 8 318 150 12 8 304 150 10.5 8 302 140 10 8 429 198 9 8 454 220 8.5 8 440 215 10 8 455 225 8.5 8 390 190 MPG Model Model_Year Origin Weight 18 [1x36 char] 70 USA 3504 15 [1x36 char] 70 USA 3693 18 [1x36 char] 70 USA 3436 16 [1x36 char] 70 USA 3433 17 [1x36 char] 70 USA 3449 15 [1x36 char] 70 USA 4341 14 [1x36 char] 70 USA 4354 14 [1x36 char] 70 USA 4312 14 [1x36 char] 70 USA 4425 15 [1x36 char] 70 USA 3850 ```

We can access individual columns by referencing them by their names...

```clc cars(1:10, {'Origin', 'MPG', 'Weight'}) ```
```ans = Origin MPG Weight USA 18 3504 USA 15 3693 USA 18 3436 USA 16 3433 USA 17 3449 USA 15 4341 USA 14 4354 USA 14 4312 USA 14 4425 USA 15 3850 ```

The dot-notation allows you to extract the whole content of a variable.

```clc cars.Horsepower(1:10) ```
```ans = 130 165 150 150 140 198 220 215 225 190 ```

Dataset objects store meta information.

```get(cars) ```
``` Description: '' VarDescription: {} Units: {} DimNames: {'Observations' 'Variables'} UserData: [] ObsNames: {} VarNames: {1x9 cell} ```

We can add dataset descriptions as well as units for the variables.

```clc cars = set(cars, 'Description', 'Performance and structural information of automobiles'); cars = set(cars, 'Units', {'m/s^2', '', 'mm', 'hp', 'mpg', '', '', '', 'kg'}); summary(cars) ```
```Performance and structural information of automobiles Acceleration: [406x1 double, Units = m/s^2] min 1st Q median 3rd Q max 8 13.7000 15.5000 17.2000 24.8000 Cylinders: [406x1 double] min 1st Q median 3rd Q max 3 4 4 8 8 Displacement: [406x1 double, Units = mm] min 1st Q median 3rd Q max 68 105 151 302 455 Horsepower: [406x1 double, Units = hp] Columns 1 through 5 min 1st Q median 3rd Q max 46 75.5000 95 130 230 Column 6 NaNs 6 MPG: [406x1 double, Units = mpg] Columns 1 through 5 min 1st Q median 3rd Q max 9 17.5000 23 29 46.6000 Column 6 NaNs 8 Model: [406x36 char] Model_Year: [406x1 double] min 1st Q median 3rd Q max 70 73 76 79 82 Origin: [406x7 char] Weight: [406x1 double, Units = kg] min 1st Q median 3rd Q max 1613 2226 2.8225e+003 3620 5140 ```

## Categorical Arrays

Notice that some of the variables take on discrete values. For instance, the Cylinders, and Origin take on a unique set of values:

```clc disp('Cylinders:'); unique(cars(:, 'Cylinders')) disp('Origin:'); unique(cars(:, 'Origin')) ```
```Cylinders: ans = Cylinders 3 4 5 6 8 Origin: ans = Origin England France Germany Italy Japan Sweden USA ```

Categorical arrays provide significant memory savings. We will convert Cylinders to an ordinal array, which contains ordering information. The variable Origin will be converted to a nominal array, which does not store ordering.

```clc Cylinders_cat = ordinal(Cylinders); Origin_cat = nominal(Origin); whos Cylinders* Origin* ```
``` Name Size Bytes Class Attributes Cylinders 406x1 3248 double Cylinders_cat 406x1 1178 ordinal Origin 406x7 5684 char Origin_cat 406x1 1366 nominal ```

Now, let's convert the variables of the dataset object.

```cars.Cylinders = ordinal(cars.Cylinders); cars.Origin = nominal(cars.Origin); ```

## Filtering

Dataset objects can be easily filtered by criteria.

For example, we can create a logical array that has ONEs where the origin is Germany and ZEROs where it's not Germany.

```germanyMask = cars.Origin == 'Germany' ```
```germanyMask = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 ```

Use mask to extract out all German cars.

```clc cars(germanyMask, :) ```
```ans = Acceleration Cylinders Displacement Horsepower 20.5 4 97 46 14.5 4 107 90 12.5 4 121 113 20 4 97 48 14 4 116 90 19 4 97 60 23.5 4 97 54 18 4 121 76 21 4 97 46 15.5 4 116 75 14 4 114 91 16.5 4 98 83 15.5 4 79 67 14.5 4 97 78 16.5 4 90 71 14 4 90 70 15 4 115 95 16.9 4 116 81 14.2 4 90 70 12.2 4 97 71 16.7 6 168 120 14.5 4 97 78 14.1 4 97 78 12.8 4 121 110 21.5 4 90 48 15.9 5 131 103 14.9 4 89 71 14 4 89 71 20.1 5 183 77 14.7 4 98 76 15.8 4 97 78 21.7 4 90 48 23.7 4 90 48 19.9 5 121 67 21.8 4 146 67 15.3 4 89 62 14.2 4 105 74 15.3 4 105 74 24.6 4 97 52 MPG Model Model_Year Origin Weight 26 [1x36 char] 70 Germany 1835 24 [1x36 char] 70 Germany 2430 26 [1x36 char] 70 Germany 2234 NaN [1x36 char] 71 Germany 1978 28 [1x36 char] 71 Germany 2123 27 [1x36 char] 71 Germany 1834 23 [1x36 char] 72 Germany 2254 22 [1x36 char] 72 Germany 2511 26 [1x36 char] 73 Germany 1950 24 [1x36 char] 73 Germany 2158 20 [1x36 char] 73 Germany 2582 29 [1x36 char] 74 Germany 2219 26 [1x36 char] 74 Germany 1963 26 [1x36 char] 74 Germany 2300 25 [1x36 char] 75 Germany 2223 29 [1x36 char] 75 Germany 1937 23 [1x36 char] 75 Germany 2694 25 [1x36 char] 76 Germany 2220 29 [1x36 char] 76 Germany 1937 29.5 [1x36 char] 76 Germany 1825 16.5 [1x36 char] 76 Germany 3820 29 [1x36 char] 77 Germany 1940 30.5 [1x36 char] 77 Germany 2190 21.5 [1x36 char] 77 Germany 2600 43.1 [1x36 char] 78 Germany 1985 20.3 [1x36 char] 78 Germany 2830 31.5 [1x36 char] 78 Germany 1990 31.9 [1x36 char] 79 Germany 1925 25.4 [1x36 char] 79 Germany 3530 41.5 [1x36 char] 80 Germany 2144 34.3 [1x36 char] 80 Germany 2188 44.3 [1x36 char] 80 Germany 2085 43.4 [1x36 char] 80 Germany 2335 36.4 [1x36 char] 80 Germany 2950 30 [1x36 char] 80 Germany 3250 29.8 [1x36 char] 80 Germany 1845 33 [1x36 char] 81 Germany 2190 36 [1x36 char] 82 Germany 1980 44 [1x36 char] 82 Germany 2130 ```

Scatter plot grouped by the year of the make.

```gscatter(cars.MPG, cars.Weight, cars.Model_Year, '', 'xos'); xlabel('Miles per Gallon') ylabel('Weight') ```

We notice a general trend, but the amount of data prevents us from getting useful information.

We can use filtering to refine the visualization. Let's extract out only the cars made in 1970, 1976, or 1982.

```index = cars.Model_Year == 70 | cars.Model_Year == 76 | cars.Model_Year == 82; filtered = cars(index,:); ```

We have a more meaningful scatter plot for this smaller subset.

```gscatter(filtered.MPG, filtered.Weight, filtered.Model_Year, '', 'xos'); xlabel('Miles per Gallon') ylabel('Weight') ```

Add interactive case names to the plot

```gname(filtered.Model); ```

## Concatenate and Join

We can combine datasets by either concatenating or joining.

Concatenate

We have a different set of data that corresponds to small cars. Let's combine this with the original dataset. First, we'll create a dataset object from this data.

```% load carsmall cs = load('carsmall.mat'); % create dataset and convert variables to categorical arrays cars_s = dataset(cs); cars_s.Origin = nominal(cars_s.Origin); cars_s.Cylinders = ordinal(cars_s.Cylinders); % add additional levels match the levels from the carbig dataset cars_s.Cylinders = addlevels(cars_s.Cylinders, getlabels(cars.Cylinders)); cars_s.Cylinders = reorderlevels(cars_s.Cylinders, getlabels(cars.Cylinders)); ```
```Warning: Ignoring duplicate levels in NEWLEVELS. ```

Concatenate using the matrix concatenation notation.

```cars_all = [cars; cars_s]; % alternatively, % cars_all = vertcat(cars, cars_s); ```

Join

Joining allows you to take the data in one dataset array and assign it to the rows of another dataset array, based on matching values in a common key variable.

```clc tabulate(cars_all.Origin); ```
``` Value Count Percent England 1 0.20% France 18 3.56% Germany 48 9.49% Italy 9 1.78% Japan 94 18.58% Sweden 13 2.57% USA 323 63.83% ```

Create a new dataset that maps countries to continents.

```clc Newdata = dataset(... {nominal({'England';'France';'Germany';'Italy' ;'Japan';'Sweden';'USA' }),'Origin' }, ... {nominal({'Europe' ;'Europe';'Europe' ;'Europe';'Asia' ;'Europe';'North America'}),'Continent'}) ```
```Newdata = Origin Continent England Europe France Europe Germany Europe Italy Europe Japan Asia Sweden Europe USA North America ```

Join the two datasets to include Continent as a new variable.

```cars_all = join(cars_all, Newdata); clc cars_all(1:10:100, :) ```
```ans = Acceleration Cylinders Displacement Horsepower 12 8 307 130 17.5 4 133 115 15 4 113 95 15 6 199 90 13 6 232 100 12 8 400 170 19 4 71 65 12 8 400 175 14 8 307 130 15 4 98 80 MPG Model Model_Year Origin Weight 18 [1x36 char] 70 USA 3504 NaN [1x36 char] 70 France 3090 24 [1x36 char] 70 Japan 2372 21 [1x36 char] 70 USA 2648 19 [1x36 char] 71 USA 2634 13 [1x36 char] 71 USA 4746 31 [1x36 char] 71 Japan 1773 14 [1x36 char] 72 USA 4385 13 [1x36 char] 72 USA 4098 28 [1x36 char] 72 USA 2164 Continent North America Europe Asia North America North America North America Asia North America North America North America ```

## Dealing with Missing Data

Notice that we have some missing data in our MPG data

```clc cars(5:20, 'MPG') ```
```ans = MPG 17 15 14 14 14 15 NaN NaN NaN NaN NaN 15 14 NaN 15 14 ```

One way to deal with missing data is to substitute for the missing value. In this case, we will create a regression model to represent the performance measures (MPG) as functions of possible predictor variables (acceleration, cylinders, horsepower, model year, and weight)

```X = [ones(length(cars.MPG),1) cars.Acceleration, double(cars.Cylinders), ... cars.Displacement, cars.Horsepower, cars.Model_Year, cars.Weight]; Y = [cars.MPG]; [b,bint,r,rint,stats] = regress(Y, X); ```

Note that cars.Horsepower contains NaNs. The regress function performs listwise deletion on the independent variables

```cars.regress = X * b; fprintf('R-squared: %f\n', stats(1)); ```
```R-squared: 0.814178 ```

Examine the residual.

```residuals = cars.MPG - cars.regress; stem(cars.regress, residuals); xlabel('model'); ylabel('actual - model'); ```

For cars with low or high MPG, the model seems to underestimate the MPG, while for cars in the middle, the model overestimates the true value.

```gname(cars.Model) ```

## Clean Up Our Model

We can potentially improve the model by adding dummy variables to handle diesels, automatic transmissions, and station wagons. In addition, we can filter out 3 and 5 cylinder engines, which are rotary engines.

Dummy variable is a binary variable that has a "1" where it satisfies the criteria and "0" everywhere else.

```% Load in as a dataset object ds = dataset('file','dummy.txt'); % Concatenate carsall = [cars, ds]; carsall = set(carsall, 'Units', [get(cars, 'Units'), {'', '', '', ''}]); % Filter out 3- and 5- cylinder engines index = carsall.Cylinders == '4' | carsall.Cylinders == '6' | carsall.Cylinders == '8'; carsall = carsall(index,:); ```

## Create a New Regression Model

Create a new regression model just by looking at 4, 6, and 8 cylinders and taking into account the car type (station wagon, diesel, automatic).

```X = [ones(length(carsall.MPG),1), carsall.Acceleration, ... double(carsall.Cylinders), carsall.Displacement, carsall.Horsepower, ... carsall.Model_Year, carsall.Weight, carsall.SW, carsall.Diesel, ... carsall.Automatic]; Y = [carsall.MPG]; [b, bint, r, rint, stats] = regress(Y, X); carsall.regress = X * b; residuals2 = carsall.MPG - carsall.regress; stem(carsall.regress, residuals2) xlabel('model'); ylabel('actual - model'); gname(carsall.Model) ```

## Multivariate Analysis of Variance

Multivariate analysis of variance to see how similar the cars from various countries are, in terms of MPG, Acceleration, Weight, and Displacement.

```X = [carsall.MPG, carsall.Acceleration, carsall.Weight, carsall.Displacement]; [d, p, stats] = manova1(X, carsall.Origin); manovacluster(stats) ```

We see that Japanese and German cars are quite similar, and they are very different from English and American cars

Let's add another dummy variable that distinguished Japanese and German cars. Then redo the regression.

```carsall.dummy = (carsall.Origin == 'Germany' | carsall.Origin == 'Japan'); X = [ones(length(carsall.MPG),1), carsall.Acceleration, ... double(carsall.Cylinders), carsall.Displacement, carsall.Horsepower, ... carsall.Model_Year, carsall.Weight, carsall.SW, carsall.Diesel, ... carsall.Automatic carsall.dummy]; Y = [carsall.MPG]; [b, bint, r, rint, stats] = regress(Y, X); carsall.regress = X * b; % Inspect once again residuals2 = carsall.MPG - carsall.regress; stem(carsall.regress, residuals2) xlabel('model'); ylabel('actual - model'); gname(carsall.Model) ```

## Robust Regression

We can also perform robust regression to deal with the outliers that may exist in the dataset.

```X2 = [carsall.Acceleration, double(carsall.Cylinders), ... carsall.Displacement, carsall.Horsepower, carsall.Model_Year, ... carsall.Weight, carsall.SW, carsall.Diesel, carsall.Automatic, carsall.dummy]; [robustbeta, stats] = robustfit(X2, Y) X3 = [ones(length(carsall.MPG),1), X2]; carsall.regress2 = X3 * robustbeta; ```
```robustbeta = -4.1872 -0.2086 -1.6006 0.0126 -0.0166 0.6444 -0.0048 0.1167 12.4719 -2.6949 1.7508 stats = ols_s: 2.9910 robust_s: 2.8012 mad_s: 2.6344 s: 2.8477 resid: [399x1 double] rstud: [399x1 double] se: [11x1 double] covb: [11x11 double] coeffcorr: [11x11 double] t: [11x1 double] p: [11x1 double] w: [399x1 double] R: [11x11 double] dfe: 374 h: [399x1 double] ```

## Perform Regression Substitution

We have been looking at linear regressions so far, but we might be able to apply some nonlinear regressions to get a better predictive model.

Once we have a regression model, we can go ahead substitute the missing values with the model data.

```carsall.mask = isnan(carsall.MPG); carsall.MPG(carsall.mask) = carsall.regress2(carsall.mask); carsall(5:20, 'MPG') ```
```ans = MPG 17 15 14 14 14 15 18.881 12.256 13.095 12.597 13.732 15 14 16.498 15 14 ```