Statistics Toolbox

Key Features

  • Regression techniques, including linear, generalized linear, nonlinear, robust, regularized, ANOVA, and mixed-effects models
  • Repeated measures modeling for data with multiple measurements per subject
  • Univariate and multivariate probability distributions, including copulas and Gaussian mixtures
  • Random and quasi-random number generators and Markov chain samplers
  • Hypothesis tests for distributions, dispersion, and location, and design of experiments (DOE) techniques for optimal, factorial, and response surface designs
  • Supervised machine learning algorithms, including support vector machines (SVMs), boosted and bagged decision trees, k-nearest neighbor, Naïve Bayes, and discriminant analysis
  • Unsupervised machine learning algorithms, including k-means and hierarchical clustering, Gaussian mixtures, and hidden Markov models

Exploratory Data Analysis

Statistics Toolbox provides multiple ways to explore data: statistical plotting with interactive graphics, algorithms for cluster analysis, and descriptive statistics for large data sets.

Statistical Plotting and Interactive Graphics

Statistics Toolbox includes graphs and charts to visually explore your data. The toolbox augments MATLAB® plot types with probability plots, box plots, histograms, scatter histograms, 3D histograms, control charts, and quantile-quantile plots. The toolbox also includes specialized plots for multivariate analysis, including dendrograms, biplots, parallel coordinate charts, and Andrews plots.

Group scatter plot matrix showing interactions between five variables.
Group scatter plot matrix showing interactions between five variables.

Visualizing Multivariate Data (Example)
How to visualize multivariate data using various statistical plots.

Compact box plot with whiskers for response grouped by year to look for potential year-specific fixed effects.
Compact box plot with whiskers for response grouped by year to look for potential year-specific fixed effects.
Scatter histogram using a combination of scatter plots and histograms to describe the relationship between variables.
Scatter histogram using a combination of scatter plots and histograms to describe the relationship between variables.
Plot comparing the empirical CDF for a sample from an extreme value distribution with a plot of the CDF for the sampling distribution.
Plot comparing the empirical CDF for a sample from an extreme value distribution with a plot of the CDF for the sampling distribution.

Modelling Data with the Generalized Extreme Value Distribution (Example)
How to fit the generalized extreme value distribution using maximum likelihood estimation.

Descriptive Statistics

Descriptive statistics enable you to understand and describe potentially large sets of data quickly. Statistics Toolbox includes functions for calculating:

These functions help you summarize values in a data sample using a few highly relevant numbers.

Resampling Techniques

In some cases, estimating summary statistics using parametric methods is not possible. To deal with these cases, Statistics Toolbox provides resampling techniques, including:

  • Random sampling from a dataset with or without replacement
  • Generalized bootstrap function for estimating sample statistics using resampling
  • jackknife function for estimating sample statistics using subsets of the data
  • bootci function for estimating confidence intervals

Regression and ANOVA

Regression

With regression, you can model a continuous response variable as a function of one or more predictors. Statistics Toolbox offers a wide variety of regression algorithms, including linear regression, generalized linear models, nonlinear regression, and mixed-effects models.

Linear Regression

Linear regression is a statistical modeling technique used to describe a continuous response variable as a function of one or more predictor variables. It can help you understand and predict the behavior of complex systems or analyze experimental, financial, and biological data.

The toolbox offers several types of linear regression models and fitting methods, including:

  • Simple: model with only one predictor
  • Multiple: model with multiple predictors
  • Multivariate: model with multiple response variables
  • Robust:  model in the presence of outliers
  • Stepwise: model with automatic variable selection
  • Regularized:  model that can deal with redundant predictors and prevent overfitting using ridge, lasso, and elastic net algorithms

Computational Statistics: Feature Selection, Regularization, and Shrinkage with MATLAB 36:51
In this webinar, you will learn how to use Statistics Toolbox to generate accurate predictive models from data sets that contain large numbers of correlated variables.

Nonlinear Regression

Nonlinear regression is a statistical modeling technique that helps describe nonlinear relationships in experimental data. Nonlinear regression models are generally assumed to be parametric, where the model is described as a nonlinear equation. Typically machine learning methods are used for non-parametric nonlinear regression.

The toolbox also offers robust nonlinear fitting to deal with outliers in the data.

Fitting with MATLAB: Statistics, Optimization, and Curve Fitting 38:37
In this webinar, you will learn applied curve fitting using MathWorks products. MathWorks engineers will present a series of techniques for solving real world challenges.

Generalized Linear Models

Generalized linear models are a special case of nonlinear models that use linear methods. They allow for the response variables to have nonnormal distributions and a link function that describes how the expected value of the response is related to the linear predictors.

Statistics Toolbox supports fitting generalized linear models with the following response distributions:

  • Normal (probit regression)
  • Binomial (logistic regression)
  • Poisson
  • Gamma
  • Inverse Gaussian

Fitting Data with Generalized Linear Models (Example)
How to fit and evaluate generalized linear models using glmfit and glmval.

Mixed-Effects Models

Linear and nonlinear mixed-effects models are generalizations of linear and nonlinear models for data that is collected and summarized in groups. These models describe the relationship between a response variable and independent variables, with coefficients that can vary with respect to one or more grouping variables.

Statistics Toolbox supports fitting multilevel or hierarchical models with nested and/or crossed random effects, which can be used to perform a variety of studies, including:

  • Longitudinal analysis/panel analysis
  • Repeated measures modeling
  • Growth modeling
Plot comparing Gross State Product for three states fitted using a multilevel mixed-effects model and ordinary least-squares.
Plot comparing Gross State Product for three states fitted using a multilevel mixed-effects model (left) and ordinary least-squares (right). The fitlme function in Statistics Toolbox can create models with increased prediction accuracy when data is collected and summarized in groups.

Model Assesment

Statistics Toolbox enables you to perform model assessment for regression algorithms using tests for statistical significance and goodness-of-fit measures such as:

  • F-statistic and t-statistic
  • R2 and adjusted R2
  • Cross-validated mean squared error
  • Akaike information criterion (AIC) and Bayesian information criterion (BIC)

You can calculate confidence intervals for both regression coefficients and predicted values.

Nonparametric Regression

Statistics Toolbox also supports nonparametric regression techniques for generating an accurate fit without specifying a model that describes the relationship between the predictor and the response. Nonparametric regression techniques can be more broadly classified under supervised machine learning for regression and include decision trees as well as boosted and bagged regression trees.

Nonparametric Fitting 4:07
Develop a predictive model without specifying a function that describes the relationship between variables.

ANOVA

Analysis of variance (ANOVA) enables you to assign sample variance to different sources and determine whether the variation arises within or among different population groups. Statistics Toolbox includes these ANOVA algorithms and related techniques:

Machine Learning

Machine learning algorithms use computational methods to "learn" information directly from data without assuming a predetermined equation as a model. They can adaptively improve their performance as you increase the number of samples available for learning.

Machine Learning with MATLAB 3:02
Prepare data and train machine learning models with MATLAB®.

Classification

Classification algorithms enable you to model a categorical response variable as a function of one or more predictors. Statistics Toolbox offers a wide variety of parametric and nonparametric classification algorithms, such as:

An Introduction to Classification 9:00
Develop predictive models for classifying data.

You can evaluate goodness of fit for the resulting classification models using techniques such as:

Cluster Analysis

Statistics Toolbox offers multiple algorithms to analyze data using k-means, hierarchical clustering, Gaussian mixture models, or hidden Markov models. When the number of clusters is unknown, the toolbox offers cluster evaluation techniques to determine the number of clusters present in the data based on a specified metric.

Plot showing natural patterns in gene expression profiles obtained from baker’s yeast.
Plot showing natural patterns in gene expression profiles obtained from baker’s yeast. Principal component analysis (PCA) and k-means clustering algorithms are used to find clusters in the profile data.

Cluster Genes Using K-Means and Self-Organizing Maps (Example)
Learn how to detect patterns in gene expression profiles by examining gene expression data

Two-component Gaussian mixture model fit to a mixture of bivariate Gaussians.
Two-component Gaussian mixture model fit to a mixture of bivariate Gaussians.
Output from applying a clustering algorithm to the same example.
Output from applying a clustering algorithm to the same example.
Dendrogram plot showing a model with four clusters.
Dendrogram plot showing a model with four clusters.

Cluster Analysis (Example)
Use k-means and hierarchical clustering to discover natural groupings in data.

Regression

Regression algorithms enable you to model a continuous response variable as a function of one or more predictors. Statistics Toolbox offers a wide variety of parametric and nonparametric classification algorithms, such as:

Computational Statistics: Feature Selection, Regularization, and Shrinkage with MATLAB 36:51
In this webinar, you will learn how to use Statistics Toolbox to generate accurate predictive models from data sets that contain large numbers of correlated variables.

Multivariate Statistics

Multivariate statistics provide algorithms and functions to analyze multiple variables. Typical applications include dimensionality reduction by feature transformation and feature selection, and exploring relationships between variables using visualization techniques, such as scatter plot matrices and classical multidimensional scaling.

Fitting an Orthogonal Regression Using Principal Component Analysis (Example)
Implement Deming regression (total least squares).

Feature Transformation

Feature transformation (sometimes called feature extraction) is a dimensionality reduction technique that transforms existing features into new features (predictor variables) where less descriptive features can be dropped. The toolbox offers the following approaches for feature transformation:

Partial Least Squares Regression and Principal Component Regression (Example)
Model a response variable in the presence of highly correlated predictors.

Feature Selection

Feature selection is a dimensionality reduction technique that selects only the subset of measured features (predictor variables) that provide the best predictive power in modeling the data. It is useful when you are dealing with high-dimensional data or when collecting data for all features is cost prohibitive.

Feature selection methods include:

  • Stepwise regression sequentially adds or removes features until there is no improvement in prediction accuracy; it can be used with linear regression or generalized linear regression algorithms.
  • Sequential feature selection is similar to stepwise regression and can be used with any supervised learning algorithm and a custom performance measure.
  • Regularization (lasso and elastic nets) uses shrinkage estimators to remove redundant features by reducing their weights (coefficients) to zero.

Feature selection can be used to:

  • Improve the accuracy of a machine learning algorithm
  • Boost the performance on very high-dimensional data
  • Improve model interpretability
  • Prevent overfitting

Selecting Features for Classifying High-Dimensional Data (Example)
Select important features for cancer detection.

Multivariate Visualization

Statistics Toolbox provides graphs and charts to explore multivariate data visually, including:

  • Scatter plot matrices
  • Dendograms
  • Biplots
  • Parallel coordinate charts
  • Andrews plots
  • Glyph plots
Group scatter plot matrix showing how model year impacts different variables.
Group scatter plot matrix showing how model year impacts different variables.
Biplot showing the first three loadings from a principal component analysis.
Biplot showing the first three loadings from a principal component analysis.
Andrews plot showing how country of original impacts the variables.
Andrews plot showing how country of origin impacts the variables.

Probability Distributions

Statistics Toolbox provides functions and an app to work with parametric and nonparametric probability distributions.

The toolbox lets you compute, fit, and generate samples from over 40 different distributions, including:

See the complete list of supported distributions.
With these tools, you can:

  • Fit distributions to data
  • Use statistical plots to evaluate goodness of fit
  • Compute key functions such as probability density functions and cumulative distribution functions
  • Generate random and quasi-random number streams from probability distributions

Fitting Distributions to Data

The Distribution Fitting app enables you to fit data using predefined univariate probability distributions, a nonparametric (kernel-smoothing) estimator, or a custom distribution that you define. This app supports both complete data and censored (reliability) data. You can exclude data, save and load sessions, and generate MATLAB code.

Visual plot of distribution data and summary statistics.
Visual plot of distribution data (left) and summary statistics (right). Using the Distribution Fitting app, you can estimate a normal distribution with mean and variance values (16.9 and 8.7, respectively, in this example).

You can estimate distribution parameters at the command line or construct probability distributions that correspond to the governing parameters.

Additionally, you can create multivariate probability distributions, including Gaussian mixtures and multivariate normal, multivariate t, and Wishart distributions. You can use copulas to create multivariate distributions by joining arbitrary marginal distributions using correlation structures.

Simulating Dependent Random Numbers Using Copulas (Example)
Create distributions that model correlated multivariate data.

With the toolbox, you can specify custom distributions and fit these distributions using maximum likelihood estimation.

Fitting Custom Univariate Distributions (Example)
Perform maximum likelihood estimation on truncated, weighted, or bimodal data.

Evaluating Goodness of Fit

Statistics Toolbox provides statistical plots to evaluate how well a data set matches a specific distribution. The toolbox includes probability plots for a variety of standard distributions, including normal, exponential, extreme value, lognormal, Rayleigh, and Weibull. You can generate probability plots from complete data sets and censored data sets. Additionally, you can use quantile-quantile plots to evaluate how well a given distribution matches a standard normal distribution.

Statistics Toolbox also provides hypothesis tests to determine whether a data set is consistent with different probability distributions. Specific tests include:

  • Chi-square goodness-of-fit tests
  • One-sided and two-sided Kolmogorov-Smirnov tests
  • Lilliefors tests
  • Ansari-Bradley tests
  • Jarque-Bera tests

Analyzing Probability Distributions

Statistics Toolbox provides functions for analyzing probability distributions, including:

  • Probability density functions
  • Cumulative density functions
  • Inverse cumulative density functions
  • Negative log-likelihood functions

Generating Random Numbers

Statistics Toolbox provides functions for generating pseudo-random and quasi-random number streams from probability distributions. You can generate random numbers from either a fitted or constructed probability distribution by applying the random method.

MATLAB code for constructing a Poisson Distribution with a specific mean and generating a vector of random numbers that match the distribution.
MATLAB code for constructing a Poisson distribution with a specific mean and generating a vector of random numbers that match the distribution.

Statistics Toolbox also provides functions for:

  • Generating random samples from multivariate distributions, such as t, normal, copulas, and Wishart
  • Sampling from finite populations
  • Performing Latin hypercube sampling
  • Generating samples from Pearson and Johnson systems of distributions

You can also generate quasi-random number streams. Quasi-random number streams produce highly uniform samples from the unit hypercube. Quasi-random number streams can often accelerate Monte Carlo simulations because fewer samples are required to achieve complete coverage.

Code Generation

MATLAB Coder lets you generate portable and readable C code for more than 100 Statistics Toolbox functions including probability distribution and descriptive statistics. The generated code can be used for:

  • Standalone execution
  • Integration with other software
  • Accelerating statistics algorithms
  • Embedded implementation

Speed Up Statistical Computations Using Parallel Computing

Statistics Toolbox can be used with Parallel Computing Toolbox™ to decrease computation time. The toolbox has built-in parallel computing support for algorithms such as cross-validation, bootstrapping, and allows you to speed up Monte Carlo simulation, or other statistical problems.

Built-in support for parallel computing in Statistics Toolbox enables you to run statistical computations in parallel to gain speed and reduce the execution time of your program or functions.

Reproducible Parallel Computations

You can speed up random number generation while maintaining the same statistical properties of random numbers generated without parallelization. This allows for your computation using these random numbers to be completely reproducible.

Hypothesis Testing, Design of Experiments, and Statistical Process Control

Hypothesis Testing

Random variation can make it difficult to determine whether samples taken under different conditions are actually different. Hypothesis testing is an effective tool for analyzing whether sample-to-sample differences are significant and require further evaluation or are consistent with random and expected data variation.

Statistics Toolbox supports widely used parametric and nonparametric hypothesis testing procedures, including:

  • One-sample and two-sample t-tests
  • Nonparametric tests for one sample, paired samples, and two independent samples
  • Distribution tests (chi-square, Jarque-Bera, Lillifors, and Kolmogorov-Smirnov)
  • Comparison of distributions (two-sample Kolmogorov-Smirnov)
  • Tests for autocorrelation and randomness
  • Linear hypothesis tests on regression coefficients

Selecting a Sample Size (Example)
Calculate the sample size necessary for a hypothesis test.

Design of Experiments

Functions for design of experiments (DOE) enable you to create and test practical plans to gather data for statistical modeling. These plans show how to manipulate data inputs in tandem to generate information about their effect on data outputs. Supported design types include:

  • Full factorial
  • Fractional factorial
  • Response surface (central composite and Box-Behnken)
  • D-optimal
  • Latin hypercube

You can use Statistics Toolbox to define, analyze, and visualize a customized DOE. For example, you can estimate input effects and input interactions using ANOVA, linear regression, and response surface modeling, then visualize results through main effect plots, interaction plots, and multivari charts.

Fitting a decision tree to data.
Fitting a decision tree to data. The fitting capabilities in Statistics Toolbox enable you to visualize a decision tree by drawing a diagram of the decision rule and group assignments.
Model of a chemical reaction for an experiment using the design-of-experiments (DOE) and surface-fitting capabilities of Statistics Toolbox.
Model of a chemical reaction for an experiment using the design-of-experiments (DOE) and surface-fitting capabilities of Statistics Toolbox.

Statistical Process Control

Statistics Toolbox provides a set of functions that support statistical process control (SPC). These functions enable you to monitor and improve products or processes by evaluating process variability. With SPC functions, you can:

  • Perform gage repeatability and reproducibility studies
  • Estimate process capability
  • Create control charts
  • Apply Western Electric and Nelson control rules to control chart data
Control charts showing process data and violations of Western Electric control rules.
Control charts showing process data and violations of Western Electric control rules. Statistics Toolbox provides a variety of control charts and control rules for monitoring and evaluating products or processes.

Try Statistics Toolbox

Get trial software

Introduction to MATLAB

View webinar

Apply Machine Learning Techniques to Classify Data

Learn how