MATLAB Examples

Load the Fisher iris sample data. The data contains length and width measurements from the sepals and petals of three species of iris flowers. Store the petal length data for the versicolor

Use anovan to fit models where a factor's levels represent a random selection from a larger (infinite) set of possible levels.

Perform N-way ANOVA on car data with mileage and other information on 406 cars made between 1970 and 1982.

Perform one-way ANOVA to determine whether data from several groups have a common mean.

Perform two-way ANOVA to determine the effect of car model and factory on the mileage rating of cars.

Load the sample data.

Use logistic regression and other techniques to perform data analysis on tall arrays. Tall arrays represent data that is too large to fit into computer memory.

Perform statistical analysis and machine learning on out-of-memory data with MATLAB® and Statistics and Machine Learning Toolbox™.

Generate a nonlinear classifier with Gaussian kernel function. First, generate one class of points inside the unit disk in two dimensions, and another class of points in the annulus from

Perform linear and quadratic classification of Fisher iris data.

Use a random subspace ensemble to increase the accuracy of classification. It also shows how to use cross validation to determine good parameters for both the weak learner template and the

You can also use ensembles of decision trees for classification. For this example, use ionosphere data with 351 observations and 34 real-valued predictors. The response variable is

When you have missing data, trees and ensembles of trees give better predictions when they include surrogate splits. Furthermore, estimates of predictor importance are often different

Create a classification tree ensemble for the ionosphere data set, and use it to predict the classification of a radar return with average measurements.

Demonstrates fitting a non-linear regression tree model to hourly day-ahead electricity prices in the New England pool region. The log electricity prices are modeled with two additive

Small, round blue-cell tumors (SRBCTs) belong to four distinct diagnostic categories. The categories have widely differing prognoses and treatment options, making it extremely

Obtain the benefits of the LPBoost and TotalBoost algorithms. These algorithms share two beneficial characteristics:

The RobustBoost algorithm can make good classification predictions even when the training data has noise. However, the default RobustBoost parameters can produce an ensemble that does

Predict posterior probabilities of SVM models over a grid of observations, and then plot the posterior probabilities over the grid. Plotting posterior probabilities exposes decision

Make a more robust and simpler model by trying to remove predictors without hurting the predictive power of the model. This is especially important when you have many predictors in your data.

Determine which quadrant of an image a shape occupies by training an error-correcting output codes (ECOC) model comprised of linear SVM binary learners. This example also illustrates the

Train a basic discriminant analysis classifier to classify irises in Fisher's iris data.

Train an ensemble of classification trees using data containing predictors with many categorical levels.

Perform classification using discriminant analysis, naive Bayes classifiers, and decision trees. Suppose you have a data set containing observations with measurements on different

Build an automated credit rating tool.

Use a custom kernel function, such as the sigmoid kernel, to train SVM classifiers, and adjust custom kernel function parameters.

Train an ensemble of classification trees with unequal classification costs. This example uses data on patients with hepatitis to see if they live or die as a result of the disease. The data

Tune the regularization parameter in fscnca using cross-validation. Tuning the regularization parameter helps to correctly detect the relevant features in the data.

Optimize an SVM classification. The classification works on locations of points from a Gaussian mixture model. In The Elements of Statistical Learning, Hastie, Tibshirani, and Friedman

Optimize an SVM classification using the fitcsvm function and OptimizeHyperparameters name-value pair. The classification works on locations of points from a Gaussian mixture model. In

Products used: Statistics and Machine Learning Toolbox™ and MATLAB® Coder™.

Find the indices of the three nearest observations in X to each observation in Y with respect to the chi-square distance. This distance metric is used in correspondence analysis,

Classify query data by:

Gaussian mixture models (GMM) are often used for data clustering. Usually, fitted GMMs cluster by assigning query data points to the multivariate normal components that maximize the

Predict classification for a k-nearest neighbor classifier.

Examine the quality of a k-nearest neighbor classifier using resubstitution and cross validation.

Modify a k-nearest neighbor classifier.

Construct a k-nearest neighbor classifier for the Fisher iris data.

Add a MATLAB Function block to a Simulink® for label prediction. The MATLAB Function block accepts streaming data, and predicts the label and classification score using a trained, support

Products used: Statistics and Machine Learning Toolbox™, MATLAB® Coder™, Simulink®, and Computer Vision Toolbox™.

Generate code from a prediction function that has variable-sized input arguments. Specifically, the example creates a function that predicts labels based on a trained classification

Use a Stateflow® chart for label prediction. The example trains a discriminant analysis model for the Fisher iris data set by using fitcdiscr, and declares a function for code generation

Generate code for finding the nearest neighbor using an exhaustive searcher object at the command line. This example shows two different methods depending on the way you use the object: load

Create scatter plots using grouped sample data.

Several common indexing and searching methods.

Several indexing and searching methods for categorical arrays.

Compute and compare measures of dispersion for sample data that contains one outlier.

Explore the distribution of data using descriptive statistics.

Plot data grouped by the levels of a categorical variable.

Categorize numeric data into a categorical ordinal array using ordinal. This is useful for discretizing continuous data.

Work with dataset array variables and their data.

Select an observation or subset of observations from a dataset array.

Determine sorting order for ordinal arrays.

Compute and compare measures of location for sample data that contains one oulier.

Compute summary statistics grouped by levels of a categorical variable. You can compute group summary statistics for a numeric array or a dataset array using grpstats.

Create nominal arrays using nominal.

Sort observations (rows) in a dataset array using the command line. You can also sort rows using the Variables editor.

Create a 3-by-3 matrix of sample data. Remove two data values by replacing them with NaN.

Create a dataset array from a numeric array existing in the MATLAB® workspace.

Change the labels for category levels in categorical arrays using setlabels. You also have the option to specify labels when creating a categorical array.

Merge categories in a categorical array using mergelevels. This is useful for collapsing categories with few observations.

Reorder the category levels in nominal arrays using reorderlevels. By definition, nominal array categories have no natural ordering. However, you might want to change the order of levels

Create a dataset array from heterogeneous variables existing in the MATLAB® workspace.

Create ordinal arrays using ordinal.

Reorder the category levels in an ordinal array using reorderlevels.

Use cmdscale to perform classical (metric) multidimensional scaling, also known as principal coordinates analysis.

Analyze if companies within the same sector experience similar week-to-week changes in stock price.

Perform nonnegative matrix factorization.

Use Procrustes analysis to compare two handwritten number threes. Visually and analytically explore the effects of forcing size and reflection changes.

Perform feature selection that is robust to outliers using a custom robust loss function in NCA.

Use an output function in tsne.

Visualize the MNIST data, which consists of images of handwritten digits, using the tsne function. The images are 28-by-28 pixels in grayscale. Each image has an associated label from 0

Use rica to disentangle mixed audio signals. You can use rica to perform independent component analysis (ICA) when prewhitening is included as a preprocessing step. The ICA model is

The effects of various tsne settings.

A complete workflow for feature extraction from image data.

This demo showcases visualization and analysis (heavy statistics) for forecasting energy usage based on historical data. We have access to hour-by-hour utility usage for the month of

This example was authored by the MathWorks community.

The problem of Feature Selection is: Given a (usually large) number of noisy and partly redundant variables and a target that we would like to predict, choose a small but indicative subset as

Demo file for the Data Management and Statistics Webinar. This demo requires the Statistics Toolbox and was created using MATLAB 7.7 (R2008b).

The CGDS toolbox provides a set of functions for retrieving data from the cBio Cancer Genomics Data Portal web API. Get started by adding the CGDS toolbox directory to the path and setting the

Estimate the cumulative distribution function (CDF) from data in a non-parametric or semi-parametric fashion. It also illustrates the inversion method for generating random numbers

Use some more advanced techniques with the Statistics and Machine Learning Toolbox™ function mle to fit custom distributions to univariate data. The techniques include fitting models to

Fit the generalized extreme value distribution using maximum likelihood estimation. The extreme value distribution is used to model the largest or smallest value from a group or block of

The difference between fitting a curve to a set of points, and fitting a probability distribution to a sample of data.

Use the Statistics and Machine Learning Toolbox™ function mle to fit custom distributions to univariate data.

Analyze lifetime data with censoring. In biological or medical applications, this is known as survival analysis, and the times may represent the survival time of an organism or the time

Fit tail data to the Generalized Pareto distribution by maximum likelihood estimation.

Fit univariate distributions using least squares estimates of the cumulative distribution functions. This is a generally-applicable method that can be useful in cases when maximum

Compare multiple sample distributions using a statistic of the maximum difference in probability of cumulative distributions. An extension of the Kolomogorov-Smirnov test to compare

Determine the number of samples or observations needed to carry out a statistical test. It illustrates sample size calculations for a simple problem, then shows how to use the sampsizepwr

Use hypothesis testing to analyze gas prices measured across the state of Massachusetts during two separate months.

Demo file from August 7, 2007 webinar titled Data Analysis with Statistics Toolbox and Curve Fitting Toolbox. View the recorded webinar:

Statins are the most common class of drugs used for treating hyperlipdemia. However, studies have shown that even at their maximum dosage of 80 mg, many patients do not reach LDL cholesterol

This is a walkthrough of the Demo shown in the 30 November 2006 Webinar titled "Using Statistics for Uncertainty Analysis in System Models". The demo covers two basic topics:

Estimate and plot the cumulative hazard and survivor functions for different groups.

This demo uses MATLAB, the Statistics Toolbox, the Curve Fitting Toolbox, and the Optimization Toolbox to improve the design of an engine cooling fan using Design for Six Sigma Techniques.

Find the empirical survivor functions and the parametric survivor functions using the Burr type XII distribution fit to data for two groups.

Convert survival data to counting process form and then construct a Cox proportional hazards model with time-dependent covariates.

Construct a Cox proportional hazards model, and assess the significance of the predictor variables.

Machine learning techniques are often used for financial analysis and decision-making tasks such as accurate forecasting, classification of risk, estimating probabilities of default,

Clustering is a form of unsupervised learning technique. The purpose of clustering is to identify natural groupings of data from a large data set to produce a concise representation based on

In the past decade the development of automatic techniques to estimate the intrinsic dimensionality of a given dataset has gained considerable attention due to its relevance in several

Demonstrates fitting a non-linear temperature model to hourly dry bulb temperatures recorded in the New England region. The temperature series is modeled as a sum of two compoments, a

The Natural Gas Price model, Temperature model and Electricity Price hybrid model are jointly simulated to create market scenarios. Then, given a set of plant parameters and constraints a

When the arrays are too large, computing the entire array may not fit entirely into memory. ipdm is smart enough to break the problem up to accomplish the task anyway. In this example, the

Visualize dissimilarity data using non-classical forms of multidimensional scaling (MDS).

Use Principal Components Analysis (PCA) to fit a linear regression. PCA minimizes the perpendicular distances from the data to the fitted model. This is the linear case of what is known as

Visualize multivariate data using various statistical plots. Many statistical analyses involve only two variables: a predictor variable and a response variable. Such data are easy to

Perform "classical" multidimensional scaling, using the cmdscale function in the Statistics and Machine Learning Toolbox™. Classical multidimensional scaling, also known as

Select features for classifying high-dimensional data. More specifically, it shows how to perform sequential feature selection, which is one of the most popular feature selection

Perform factor analysis using Statistics and Machine Learning Toolbox™.

Examine similarities and dissimilarities of observations or objects using cluster analysis in Statistics and Machine Learning Toolbox™. Data often fall naturally into groups, or

Compute and plot the pdf of a Poisson distribution with parameter lambda = 5.

Use copulafit to calibrate copulas with data. To generate data Xsim with a distribution "just like" (in terms of marginal distributions and correlations) the distribution of data in the

Similar to the bootstrap is the jackknife, which uses resampling to estimate the bias of a sample statistic. Sometimes it is also used to estimate standard error of the sample statistic. The

Plot the pdf of a bivariate Student's t distribution. You can use this distribution for a higher number of dimensions as well, although visualization is not easy.

Compute and plot the pdf using four different values for the parameter r, the desired number of successes: .1, 1, 3, and 6. In each case, the probability of success p is .5.

As for all discrete distributions, the cdf is a step function. The plot shows the discrete uniform cdf for N = 10.

Compute and plot the pdf of a multivariate normal distribution.

The bootstrap procedure involves choosing random samples with replacement from a data set and analyzing each sample the same way. Sampling with replacement means that each observation is

Compute the pdf of an F distibution with 5 numerator degrees of freedom and 3 denominator degrees of freedom.

Compute the pdf of a gamma distribution with parameters A = 100 and B = 10. For comparison, also compute the pdf of a normal distribution with parameters mu = 1000 and sigma = 100.

Compute the pdf of an exponential distribution with parameter mu = 2.

Several examples show how to use the gkdeb function.

Pick a random sample of 10 from a list of 553 items.

Suppose the income of a family of four in the United States follows a lognormal distribution with mu = log(20,000) and sigma = 1. Compute and plot the income density.

Compute the pdf for a Student's t distribution with parameter nu = 5, and for a standard normal distribution.

Compute the pdf of a chi-square distribution with 4 degrees of freedom.

Compute and plot the cdf of a hypergeometric distribution.

Since the bivariate normal distribution is defined on the plane, you can also compute cumulative probabilities over rectangular regions.

Generate examples of probability density functions for the three basic forms of the generalized extreme value distribution.

Suppose the probability of a five-year-old car battery not starting in cold weather is 0.03. What is the probability of the car starting for 25 consecutive days during a long cold snap?

Use haltonset to construct a 2-D Halton quasi-random point set.

Compute the pdf of an extreme value distribution.

Compute the pdf of three generalized Pareto distributions. The first has shape parameter k = -0.25, the second has k = 0, and the third has k = 1.

In this example, use a database of 1985 car imports with 205 observations, 25 predictors, and 1 response, which is insurance risk rating, or "symboling." The first 15 variables are numeric

Use Cook's Distance to determine the outliers in the data.

Test for the significance of the regression coefficients using t-statistic.

Display R-squared (coefficient of determination) and adjusted R-squared. Load the sample data and define the response and independent variables.

Fit a linear regression model. A typical workflow involves the following: import data, fit a regression, test its quality, modify it to improve the quality, and share it.

Compute the covariance matrix and standard errors of the coefficients.

Use the CovRatio statistics to determine the influential points in data. Load the sample data and define the response and predictor variables.

Identify and remove redundant predictors from a generalized linear model.

Uses a bagged ensemble so it can use all three methods of evaluating ensemble quality.

View a classification or regression tree. There are two ways to view a tree: view(tree) returns a text description and view(tree,'mode','graph') returns a graphic description of the tree.

Uses data for predicting the insurance risk of a car based on its many attributes.

Determine the observations that are influential on the fitted response values using Dffits values. Load the sample data and define the response and independent variables.

Test for autocorrelation among the residuals of a linear regression model.

Fit a generalized linear model and analyze the results. A typical workflow involves the following: import data, fit a generalized linear model, test its quality, modify it to improve the

Determine the observations that have large influence on coefficients using Dfbetas. Load the sample data and define the response and independent variables.

Compute coefficient confidence intervals.

Regularize binomial regression. The default (canonical) link function for binomial regression is the logistic function.

Compute Leverage values and assess high leverage observations. Load the sample data and define the response and independent variables.

Assess the model assumptions by examining the residuals of a fitted linear regression model.

Use assess the fit of the model and the significance of the regression coefficients using F-statistic.

There are diagnostic plots to help you examine the quality of a model. plotDiagnostics(mdl) gives a variety of plots, including leverage and Cook's distance plots. plotResiduals(mdl)

Use the methods predict , feval , and random to predict and simulate responses to new data.

Principal Component Analysis (PCA) and Partial Least Squares (PLS) are widely used tools. This code is to show their relationship through the Nonlinear Iterative PArtial Least Squares

Linear Mixed-Effect (LME) Models are generalizations of linear regression models for data that is collected and summarized in groups. Linear Mixed- Effects models offer a flexible

In this demo, we will perform statistical analysis on automotive fuel economy data provided by the United States Environmental Protection Agency. We will see how the Statistics Toolbox™

This script demonstrates the use of constrained polynomials to compute a constrained approximation. That is an approximation which fulfills predefined constraints. Such problems occur

Consider the following data from Stroup (1989a), which arise from a balanced split-plot design with the whole plots arranged in a randomized complete-block design. The variable A is the

Human activity sensor data contains observations derived from sensor measurements taken from smartphones worn by people while doing different activities (walking, lying, sitting etc).

Rare events prediction in complex technical systems has been very interesting and critical issue for many industrial and commercial fields due to huge increase of sensors and rapid growth

Demonstration of dot product, orthogonality also includes some vector addition. Information from this tutorial is used in qr decomposition and multiple regression regression approach

Linstats package provides a uniform mechanism for building any supported linear model. Once built the same model can be analyzed in many ways including least-squares regression, fit and

Construct a map of 10 US cities based on the distances between those cities, using cmdscale.

This demo showcases visualization and analysis for forecasting energy demand based on historical data. We have access to hour-by-hour utility usage for the year 2006, including

Examples A and B make it clear that if we are trying to view uniform data over the hypercube most (spherical) neighborhoods will be empty! Let us examine what happens if the data follow the

Consider the hypercube and an inscribed hypersphere with radius . Then the fraction of the volume of the cube contained in the hypersphere is given by:

The dynamic response of a 100 m high clamped-free steel beam is studied. Simulated time series are used, where the first three eigen-modes have been taken into account. More precisely, the

The time series from a SDOF is computed using the central difference method, and a white noise is used as an input force.

This case study analyzes the amount of vibration a passenger experiences for a vehicle traveling over a road disturbance (bump). We want to determine the amount of reduction in displacement

This HighChart object enables easy use of the javascript technology provided by http://www.highcharts.com/ to generate interactive and dynamic charts in MATLAB web browser.

Hypothesis testing based on a model that is invalid can lead to faulty conclusions. this tutorial goes over a few basic diagnostic procedures that can be used to test whether a model is valid.

This tutorial will go over some of the functions available for making inferences and testing hypothesis. I assume that you know how to construct a model using encode. If not see the

Among many statistical anomaly detection techniques, Hotelling’s T-square method, a multivariate statistical analysis technique, has been one of the most typical method. This method

This is an enhanced version of the regstats function (statistics toolbox). Here are implemented several ways to estimate robust standard errors (se) for the coefficients\n. Also, it

This file describes the development of a failure boundary identication algorithm shown in "Using Statistics and Optimization to Support Design Activities" Webinar, July 21, 2009.

This tutorial describes multivariate guassians as it walks through the major functioniality of the mmvn toolkit

In this script, I reproduce the results presented by John D. Holmes in the first part of the chapter 2 of his book: Wind loading of structures [1]. The notations he uses are slightly different in

From: "Using Statistics and Optimization to Support Design Activities" Webinar, July 21, 2009.

In order to illustrate some of the numerical calculations required for testing hypothesis on canonical variance components based on the LRT statistic (37), as presented in Example 3, let us

Choose your country to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .

You can also select a location from the following list:

See all countries