| Products & Services | Solutions | Academia | Support | User Community | Company |
| Download Product Updates | | | Get Pricing | | | Trial Software |
| Documentation → MATLAB |
| Contents | Index |
| Learn more about MATLAB |
| On this page… |
|---|
Begin a data analysis by loading data into suitable MATLAB container variables and sorting out the "good" data from the "bad." This is a preliminary step that assures meaningful conclusions in subsequent parts of the analysis.
Note This section begins a data analysis that is continued in Summarizing Data, Visualizing Data, and Modeling Data. |
Begin by loading the data in count.dat:
load count.dat
The 24-by-3 array count contains hourly traffic counts (the rows) at three intersections (the columns) for a single day.
See Importing and Exporting Data in the MATLAB Data Analysis documentation for more information on storing data in MATLAB variables for analysis.
The MATLAB NaN (Not a Number) value is normally used to represent missing data. NaN values allow variables with missing data to maintain their structure—in this case, 24-by-1 vectors with consistent indexing across all three intersections.
Check the data at the third intersection for NaN values using the isnan function:
c3 = count(:,3); % Data at intersection 3
c3NaNCount = sum(isnan(c3))
c3NaNCount =
0isnan returns a logical vector the same size as c3, with entries indicating the presence (1) or absence (0) of NaN values for each of the 24 elements in the data. In this case, the logical values sum to 0, so there are no NaN values in the data.
NaN values are introduced into the data in the section on Outliers.
See Missing Data in the MATLAB Data Analysis documentation for more information on handling missing data.
Outliers are data values that are dramatically different from patterns in the rest of the data. They may be due to measurement error, or they may represent significant features in the data. Identifying outliers, and deciding what to do with them, depends on an understanding of the data and its source.
One common method for identifying outliers is to look for values more than a certain number of standard deviations σ from the mean μ. The following code plots a histogram of the data at the third intersection together with lines at μ and μ + nσ, for n = 1, 2:
bin_counts = hist(c3); % Histogram bin counts
N = max(bin_counts); % Maximum bin count
mu3 = mean(c3); % Data mean
sigma3 = std(c3); % Data standard deviation
hist(c3) % Plot histogram
hold on
plot([mu3 mu3],[0 N],'r','LineWidth',2) % Mean
X = repmat(mu3+(1:2)*sigma3,2,1);
Y = repmat([0;N],1,2);
plot(X,Y,'g','LineWidth',2) % Standard deviations
legend('Data','Mean','Stds')
hold off

The plot shows that some of the data are more than two standard deviations above the mean. If you identify these data as errors (not features), replace them with NaN values as follows:
outliers = (c3 - mu3) > 2*sigma3; c3m = c3; % Copy c3 to c3m c3m(outliers) = NaN; % Add NaN values
See Inconsistent Data in the MATLAB Data Analysis documentation for more information on handling outliers.
A time-series plot of the data at the third intersection (with the outlier removed in Outliers) results in the following plot:
plot(c3m,'o-') hold on

The NaN value at hour 20 appears as a gap in the plot. This handling of NaN values is typical of MATLAB plotting functions.
Noisy data shows random variations about expected values. You may want to smooth the data to reveal its main features before building a model. Two basic assumptions underlie smoothing:
The relationship between the predictor (time) and the response (traffic volume) is smooth.
The smoothing algorithm results in values that are better estimates of expected values because the noise has been reduced.
Apply a simple moving average smoother to the data using the MATLAB convn function:
span = 3; % Size of the averaging window
window = ones(span,1)/span;
smoothed_c3m = convn(c3m,window,'same');
h = plot(smoothed_c3m,'ro-');
legend('Data','Smoothed Data')

The extent of the smoothing is controlled with the variable span. The averaging calculation returns NaN values whenever the smoothing window includes the NaN value in the data, thus increasing the size of the gap in the smoothed data.
The filter function is also used for smoothing data:
smoothed2_c3m = filter(window,1,c3m); delete(h) plot(smoothed2_c3m,'ro-');

The smoothed data are shifted from the previous plot. convn with the 'same' parameter returns the central part of the convolution, the same length as the data. filter returns the initial part of the convolution, the same length as the data. Otherwise, the algorithms are identical.
Smoothing estimates the center of the distribution of response values at each value of the predictor. It invalidates a basic assumption of many fitting algorithms, namely, that the errors at each value of the predictor are independent. Accordingly, you can use smoothed data to identify a model, but avoid using smoothed data to fit a model.
See Filtering Data in the MATLAB Data Analysis documentation for more information on smoothing and filtering.
![]() | Introduction | Summarizing Data | ![]() |

Includes the most popular MATLAB recorded presentations with Q&A sessions led by MATLAB experts.
| © 1984-2009- The MathWorks, Inc. - Site Help - Patents - Trademarks - Privacy Policy - Preventing Piracy - RSS |