Preprocessing Data

Importing Data

Introduction

You import data sets into Curve Fitting Tool with the Data Sets pane of the Data GUI. Using this pane, you can

The Data Sets pane is shown below followed by a description of its features.

Creating a Data Set

Working with Data Sets

Example: Importing Data

This example imports the ENSO data set into the Curve Fitting Tool using the Data Sets pane of the Data GUI. The first step is to load the data from the file enso.mat into the MATLAB® workspace.

load enso

The workspace contains two new variables, pressure and month:

Alternatively, you can import data by specifying the variable names as arguments to the cftool function.

cftool(month,pressure)

In this case, the Data GUI is not opened.

The data import process is described below:

  1. Select workspace variables.

    The predictor and response data are displayed graphically in the Preview window. Weights and data points containing Infs or NaNs are not displayed.

  2. Specify the data set name.

    You should specify a meaningful name when you import multiple data sets. If you do not specify a name, the default name, which is constructed from the selected variable names, is used.

  3. Click the Create data set button.

    The Data sets list box displays all the data sets added to the toolbox. Note that you can construct data sets from workspace variables, or by smoothing an existing data set.

    If your data contains Infs or complex values, a warning message like this appears.

After you click the Create data set window.

The Data Sets pane shown below displays the imported ENSO data in the Preview button, the data set enso is added to the Data sets list box. You can then view, rename, or delete enso by selecting it in the list box and clicking the appropriate button.

Viewing Data

Viewing Data Graphically

After you import a data set, it is automatically displayed as a scatter plot in Curve Fitting Tool. The response data is plotted on the vertical axis and the predictor data is plotted on the horizontal axis.

The scatter plot is a powerful tool because it allows you to view the entire data set at once, and it can easily display a wide range of relationships between the two variables. You should examine the data carefully to determine whether preprocessing is required, or to deduce a reasonable fitting approach. For example, it's typically very easy to identify outliers in a scatter plot, and to determine whether you should fit the data with a straight line, a periodic function, a sum of Gaussians, and so on.

Enhancing the Graphical Display.   Curve Fitting Toolbox™ software provides several tools for enhancing the graphical display of a data set. These tools are available through the Tools menu, the GUI toolbar, and right-click menus.

You can zoom in or out, turn on or off the grid, and so on using the Tools menu and the GUI toolbar shown below.

You can change the color, line width, line style, and marker type of the displayed data points using the right-click menu shown below. You activate this menu by placing your mouse over a data point and right-clicking. Note that a similar menu is available for fitted curves.

The ENSO data is shown below after the display has been enhanced using several of these tools.

Viewing Data Numerically

You can view the numerical values of a data set, as well as data points to be excluded from subsequent fits, with the View Data Set GUI. You open this GUI by selecting a name in the Data sets list box of the Data GUI and clicking the View button.

The View Data Set GUI for the ENSO data set is shown below, followed by a description of its features.

Smoothing Data

Introduction

If your data is noisy, you might need to apply a smoothing algorithm to expose its features, and to provide a reasonable starting approach for parametric fitting. The two basic assumptions that underlie smoothing are

You can think of smoothing as a local fit because a new response value is created for each original response value. Therefore, smoothing is similar to some of the nonparametric fit types supported by the toolbox, such as smoothing spline and cubic interpolation. However, this type of fitting is not the same as parametric fitting, which results in a global parameterization of the data.

There are two common types of smoothing methods: filtering (averaging) and local regression. Each smoothing method requires a span. The span defines a window of neighboring points to include in the smoothing calculation for each data point. This window moves across the data set as the smoothed response value is calculated for each predictor value. A large span increases the smoothness but decreases the resolution of the smoothed data set, while a small span decreases the smoothness but increases the resolution of the smoothed data set. The optimal span value depends on your data set and the smoothing method, and usually requires some experimentation to find.

Curve Fitting Toolbox software supports these smoothing methods:

Note that you can also smooth data using a smoothing spline. Refer to Nonparametric Fitting for more information.

You smooth data with the Smooth pane of the Data GUI. The pane is shown below followed by a description of its features.

Creating a Smoothed Data Set

Smoothing Method

Working with Smoothed Data Sets

Example: Smoothing Data

This example smooths the ENSO data set using the moving average, lowess, loess, and Savitzky-Golay methods with the default span. As shown below, the data appears noisy. Smoothing might help you visualize patterns in the data, and provide insight toward a reasonable approach for parametric fitting.

The Smooth pane shown below displays all the new data sets generated by smoothing the original ENSO data set. Whenever you smooth a data set, a new data set of smoothed values is created. The smoothed data sets are automatically displayed in Curve Fitting Tool. You can also display a single data set graphically and numerically by clicking the View button.

Use the Plotting GUI to display only the data sets of interest. As shown below, the periodic structure of the ENSO data set becomes apparent when it is smoothed using a moving average filter with the default span. Not surprisingly, the uncovered structure is periodic, which suggests that a reasonable parametric model should include trigonometric functions.

Saving the Results.   By clicking the Save to workspace button, you can save a smoothed data set as a structure to the MATLAB workspace. This example saves the moving average results contained in the enso (ma) data set.

The saved structure contains the original predictor data x and the smoothed data y.

smootheddata1

smootheddata1 = 
    x: [168x1 double]
    y: [168x1 double]

Excluding and Sectioning Data

Introduction

If there is justification, you might want to exclude part of a data set from a fit. Typically, you exclude data so that subsequent fits are not adversely affected. For example, if you are fitting a parametric model to measured data that has been corrupted by a faulty sensor, the resulting fit coefficients will be inaccurate.

Curve Fitting Toolbox software provides two methods to exclude data:

For each of these methods, you must create an exclusion rule, which captures the range, domain, or index of the data points to be excluded.

To exclude data while fitting, you use the Fitting GUI to associate the appropriate exclusion rule with the data set to be fit. Refer to Example: Robust Fitting for more information about fitting a data set using an exclusion rule.

You mark data to be excluded from a fit with the Exclude GUI, which you open from Curve Fitting Tool. The GUI is shown below followed by a description of its features.

Exclusion Rules

Excluding Individual Data Points

Excluding Data Sections in the Domain or Range

Marking Outliers

Outliers are defined as individual data points that you exclude from a fit because they are inconsistent with the statistical nature of the bulk of the data, and will adversely affect the fit results. Outliers are often readily identified by a scatter plot of response data versus predictor data.

Marking outliers with Curve Fitting Tool follows these rules:

As described in Parametric Fitting, one of the basic assumptions underlying curve fitting is that the data is statistical in nature and is described by a particular distribution, which is often assumed to be Gaussian. The statistical nature of the data implies that it contains random variations along with a deterministic component.

data = deterministic component + random component

However, your data set might contain one or more data points that are non-statistical in nature, or are described by a different statistical distribution. These data points might be easy to identify, or they might be buried in the data and difficult to identify.

A non-statistical process can involve the measurement of a physical variable such as temperature or voltage in which the random variation is negligible compared to the systematic errors. For example, if your sensor calibration is inaccurate, the data measured with that sensor will be systematically inaccurate. In some cases, you might be able to quantify this non-statistical data component and correct the data accordingly. However, if the scatter plot reveals that a handful of response values are far removed from neighboring response values, these data points are considered outliers and should be excluded from the fit. Outliers are usually difficult to explain away. For example, it might be that your sensor experienced a power surge or someone wrote down the wrong number in a log book.

If you decide there is justification, you should mark outliers to be excluded from subsequent fits—particularly parametric fits. Removing these data points can have a dramatic effect on the fit results because the fitting process minimizes the square of the residuals. If you do not exclude outliers, the resulting fit will be poor for a large portion of your data. Conversely, if you do exclude the outliers and choose the appropriate model, the fit results should be reasonable.

Because outliers can have a significant effect on a fit, they are considered influential data. However, not all influential data points are outliers. For example, your data set can contain valid data points that are far removed from the rest of the data. The data is valid because it is well described by the model used in the fit. The data is influential because its exclusion will dramatically affect the fit results.

Two types of influential data points are shown below for generated data. Also shown are cubic polynomial fits and a robust fit that is resistant to outliers.

Plot (a) shows that the two influential data points are outliers and adversely affect the fit. Plot (b) shows that the two influential data points are consistent with the model and do not adversely affect the fit. Plot (c) shows that a robust fitting procedure is an acceptable alternative to marking outliers for exclusion.

Sectioning

Sectioning involves specifying response or predictor data to exclude. You might want to section a data set because different parts of the data set are described by different models or are corrupted by noise, large systematic errors, and so on.

Sectioning data with Curve Fitting Tool follows these rules:

Two examples of sectioning by domain are shown below for generated data.

The upper shows the data set sectioned by fit type. The section to the left of 4 is fit with a linear polynomial, as shown by the bold, dashed line. The section to the right of 4 is fit with a cubic polynomial, as shown by the bold, solid line.

The lower plot shows the data set sectioned by fit type and by valid data. Here, the right-most section is not part of any fit because the data is corrupted by noise.

Example: Excluding and Sectioning Data

This example modifies the ENSO data set to illustrate excluding and sectioning data. First, copy the ENSO response data to a new variable and add two outliers that are far removed from the bulk of the data.

yy = pressure;
yy(ceil(length(month)*rand(1))) = mean(pressure)*2.5;
yy(ceil(length(month)*rand(1))) = mean(pressure)*3.0;

Import the variables month and yy as the new data set enso1, and open the Exclude GUI.

Assume that the first and last eight months of the data set are unreliable, and should be excluded from subsequent fits. The simplest way to exclude these data points is to section the predictor data. To do this, specify the data you want to exclude in the Exclude Sections field of the Exclude GUI.

There are two ways to exclude individual data points: using the Check to exclude point table or graphically. For this example, the simplest way to exclude the outliers is graphically. To do this, select the data set name and click the Exclude graphically button, which opens the Select Points for Exclusion Rule GUI.

To mark data points for exclusion in the GUI, place the mouse cursor over the data point and left-click. The excluded data point is marked with a red x. To include an excluded data point, right-click the data point or select the Includes Them radio button and left-click. Included data points are marked with a blue circle. To select multiple data points, click the left mouse button and drag the selection rubber band so that the rubber band box encompasses the desired data points. Note that the GUI identifies sectioned data with gray strips. You cannot graphically include sectioned data.

As shown below, the first and last eight months of data are excluded from the data set by sectioning, and the two outliers are excluded graphically. Note that the graphically excluded data points are identified in the Check to exclude point table. If you decide to include an excluded data point using the table, the graph is automatically updated.

If there are fits associated with the data, you can exclude data points based on the residuals of the fit by selecting the residual data in the Y list.

The Exclude GUI for this example is shown below.

To save the exclusion rule, click the Create exclusion rule button. To exclude the data from a fit, you must select the exclusion rule from the Fitting GUI. Because the exclusion rule created in this example uses individually excluded data points, you can use it only with data sets that are the same size as the ENSO data set.

Viewing the Exclusion Rule.   To view the exclusion rule, select an existing exclusion rule name and click the View button.

The View Exclusion Rule GUI shown below displays the modified ENSO data set and the excluded data points, which are grayed in the table.

Missing Values and Outliers

Although Curve Fitting Toolbox software ignores Infs and NaNs when fitting data, and you can exclude outliers during the fitting process, you might still want to remove this data from your data set. To do so, you modify the associated data set variables from the MATLAB command line.

For example, when using toolbox functions such as fit from the command line, you must supply predictor and response vectors that contain finite numbers. To remove Infs, you can use the isinf function.

ind = find(isinf(xx));
xx(ind) = [];
yy(ind) = [];

To remove NaNs, you can use the isnan function. For examples that remove NaNs and outliers from a data set, refer to Missing Data in the MATLAB documentation.

  


 © 1984-2008- The MathWorks, Inc.    -   Site Help   -   Patents   -   Trademarks   -   Privacy Policy   -   Preventing Piracy   -   RSS