Key Features

  • Regression techniques, including linear, generalized linear, nonlinear, robust, regularized, ANOVA, repeated measures, and mixed-effects models
  • Big data algorithms for dimension reduction, descriptive statistics, k-means clustering, linear regression, logistic regression, and discriminant analysis
  • Univariate and multivariate probability distributionsrandom and quasi-random number generators, and Markov chain samplers
  • Hypothesis tests for distributions, dispersion, and location, and design of experiments (DOE) techniques for optimal, factorial, and response surface designs
  • Classification Learner app and algorithms for supervised machine learning, including support vector machines (SVMs), boosted and bagged decision trees, k-nearest neighbor, Naïve Bayes, discriminant analysis, and Gaussian process regression
  • Unsupervised machine learning algorithms, including k-means, k-medoids, hierarchical clustering, Gaussian mixtures, and hidden Markov models
  • Bayesian optimization for tuning machine learning algorithms by searching for optimal hyperparameters
Learn how machine learning tools in MATLAB® can be used to solve regression, clustering, and classification problems.
Perform statistical modeling and analysis using Statistics and Machine Learning Toolbox™.

Exploratory Data Analysis

Statistics and Machine Learning Toolbox™ provides multiple ways to explore data: statistical plotting with interactive graphics, algorithms for cluster analysis, and descriptive statistics for large data sets.


Statistical Plotting with Interactive Graphics

Statistics and Machine Learning Toolbox includes graphs and charts to visually explore your data. The toolbox augments MATLAB® plot types with probability plots, box plots, histograms,scatter histograms, 3D histograms, control charts, and quantile-quantile plots. The toolbox also includes specialized plots for multivariate analysis, including dendrograms, biplots, parallel coordinate charts, and Andrews plots.

How to visualize multivariate data using various statistical plots.

Descriptive Statistics

Descriptive statistics enable you to understand and describe potentially large sets of data quickly using a few highly relevant numbers. Statistics and Machine Learning Toolbox includes functions for calculating:

These functions help you summarize values in a data sample using a few highly relevant numbers.

Box plot of car acceleration data grouped by country of origin.

Resampling Techniques

In some cases, performing inference on summary statistics using parametric methods is not possible. To deal with these cases, Statistics and Machine Learning Toolbox provides resampling techniques, including:

  • Random sampling from a data set with or without replacement
  • A nonparametric bootstrap function for investigating the distribution of statistics using resampling
  • A jackknife function for investigating the distribution of statistics using jackknife resampling
  • A bootci function for estimating confidence intervals using nonparametric bootstrap
Resampling LSAT score and law school GPAs to investigate correlation.

Dimensionality Reduction

Statistics and Machine Learning Toolbox provides algorithms and functions for reducing the dimensionality of your data sets. Dimensionality reduction is an important step in your data analysis because it can help improve model accuracy and performance, improve interpretability, and prevent overfitting. You can perform feature transformation and feature selection, and explore relationships between variables using visualization techniques, such as scatter plot matrices and classical multidimensional scaling.


Feature Transformation

Feature transformation (sometimes called feature extraction) is a dimensionality reduction technique that transforms existing features into new features (predictor variables) where less descriptive features can be dropped. Feature transformation methods available in Statistics and Machine Learning Toolbox include:

Perform weighted principal component analysis and interpret the results.

Feature Selection

Feature selection is a dimensionality reduction technique that selects only the subset of measured features (predictor variables) that provide the best predictive power in modeling the data. It is useful when working with high-dimensional data or when collecting data for all features is cost prohibitive. Feature selection methods available in Statistics and Machine Learning Toolbox include:

  • Stepwise regression: Sequentially adds or removes features until there is no improvement in prediction accuracy. You can use with linear regression or generalized linear regression algorithms.
  • Sequential feature selection: Similar to stepwise regression and can be used with any supervised learning algorithm and a custom performance measure
  • Boosted and bagged decision trees: Ensemble methods that compute variable importance from out-of-bag estimates
  • Regularization (lasso and elastic nets): Uses shrinkage estimators to remove redundant features by reducing their weights (coefficients) to zero
Select important features for cancer detection.

Multivariate Visualization

Statistics and Machine Learning Toolbox provides graphs and charts to explore multivariate data visually, including:

  • Scatter plot matrices
  • Dendrograms
  • Biplots
  • Parallel coordinate charts
  • Andrews plots
  • Glyph plots
Group scatter plot matrix showing how model year impacts different variables for autos.

Machine Learning

Machine learning algorithms use computational methods to "learn" information directly from data without assuming a predetermined equation as a model. Statistics and Machine Learning Toolbox provides methods for performing supervised and unsupervised machine learning.

In this webinar you will learn how to get started using machine learning tools to detect patterns and build predictive models from your datasets. In this session, you will learn about several machine learning techniques available in MATLAB and how to

Classification

Classification algorithms enable you to model a categorical response variable as a function of one or more predictors. Statistics and Machine Learning Toolbox offers an app and functions that cover a variety of parametric and nonparametric classification algorithms, such as:

Learn how to find optimal parameters of a cross-validated SVM classifier using Bayesian Optimization

Classification Learner App

You can use the Classification Learner app to perform common tasks such as interactively explore data, select features, specify cross-validation schemes, train models, and assess results. The Classification Learner app lets you train models to classify data using supervised machine learning. You can use it to perform common tasks, such as:

  • Importing data and specifying cross-validation schemes
  • Exploring data and selecting features
  • Training models using several classification algorithms
  • Comparing and assessing models
  • Sharing trained models for use in applications such as computer vision and signal processing
The Classification Learner app lets you train models to classify data using supervised machine learning.

Cluster Analysis

Statistics and Machine Learning Toolbox includes algorithms for performing cluster analysis to discover patterns in your data set by grouping data based on measures of similarity. Available algorithms include k-means, k-medoids, hierarchical clustering, Gaussian mixture models, and hidden Markov models. When the number of clusters is unknown, you can use cluster evaluation techniques to determine the number of clusters present in the data based on a specified metric.

Learn how to detect patterns in gene expression profiles by examining gene expression data.

Nonparametric Regression

Statistics and Machine Learning Toolbox also supports nonparametric regression techniques for generating an accurate fit without specifying a model that describes the relationship between the predictor and the response. Nonparametric regression techniques can be more broadly classified under supervised machine learning for regression and include decision trees, boosted or bagged regression trees, and support vector machine regression.

Predict insurance risk by training ensemble of regression trees using TreeBagger.

Regression and ANOVA


Regression

Using regression techniques, you can model a continuous response variable as a function of one or more predictors. Statistics and Machine Learning Toolbox offers a variety of regression algorithms, including linear regression, generalized linear models, nonlinear regression, and mixed-effects models.


Linear Regression

Linear regression is a statistical modeling technique used to describe a continuous response variable as a function of one or more predictor variables. It can help you understand and predict the behavior of complex systems or analyze experimental, financial, and biological data. Statistics and Machine Learning Toolbox offers several types of linear regression models and fitting methods, including:

  • Simple: Model with only one predictor
  • Multiple: Model with multiple predictors
  • Multivariate: Model with multiple response variables
  • Robust: Model in the presence of outliers
  • Stepwise: Model with automatic variable selection
  • Regularized: Model that can deal with redundant predictors and prevent overfitting using ridge, lasso, and elastic net algorithms
In this webinar, you will learn how to use Statistics and Machine Learning Toolbox to generate accurate predictive models from data sets that contain large numbers of correlated variables.

Nonlinear Regression

Nonlinear regression is a statistical modeling technique that helps describe nonlinear relationships in experimental data. Nonlinear regression models are generally assumed to be parametric, where the model is described as a nonlinear equation. Statistics and Machine Learning Toolbox also offers robust nonlinear fitting to deal with outliers in the data.

Use diagnostic plots to examine a fitted nonlinear model using diagnostic, residual, and slice plots.

Generalized Linear Models

Generalized linear models are a special case of nonlinear models that use linear methods. They allow for the response variables to have non-normal distributions and a link function that describes how the expected value of the response is related to the linear predictors. Statistics and Machine Learning Toolbox supports fitting generalized linear models with the following response distributions:

  • Normal
  • Binomial (logistic or probit regression)
  • Poisson
  • Gamma
  • Inverse Gaussian
How to fit and evaluate generalized linear models using glmfit and glmval.

Mixed-Effects Models

Linear and nonlinear mixed-effects models are generalizations of linear and nonlinear models for data that is collected and summarized in groups. These models describe the relationship between a response variable and independent variables, with coefficients that can vary with respect to one or more grouping variables. Statistics and Machine Learning Toolbox supports fitting multilevel or hierarchical, linear, nonlinear, and generalized linear mixed-effects models with nested and/or crossed random effects, which can be used to perform a variety of studies, including:

Fit and evaluate mixed-effect models using nlmefit and nlmefitsa.

Model Assessment

Statistics and Machine Learning Toolbox enables you to perform model assessment for regression algorithms using tests for statistical significance and goodness-of-fit measures such as:

  • F-statistic and t-statistic
  • R2 and adjusted R2
  • Cross-validated mean squared error
  • Akaike information criterion (AIC) and Bayesian information criterion (BIC)

You can calculate confidence intervals for both regression coefficients and predicted values.


ANOVA

Analysis of variance (ANOVA) enables you to assign sample variance to different sources and determine whether the variation arises within or among different population groups. Statistics and Machine Learning Toolbox includes these ANOVA algorithms and related techniques:

Perform N-way ANOVA on car data with mileage and other information on 406 cars made between 1970 and 1982.

Probability Distributions

Statistics and Machine Learning Toolbox provides functions and an app to work with parametric and nonparametric probability distributions. With these tools, you can fit continuous and discrete distributions, use statistical plots to evaluate goodness-of-fit, compute probability density functions and cumulative distribution functions, and generate random and quasi-random numbers from probability distributions.

The toolbox lets you compute, fit, generate random and pseudorandom number streams, and evaluate goodness-of-fit for over 40 different distributions, including:


Fitting Distributions to Data

The Distribution Fitting app enables you to fit data using predefined univariate probability distributions, a nonparametric (kernel-smoothing) estimator, or a custom distribution that you define. This app supports both complete data and censored (reliability) data. You can exclude data, save and load sessions, and generate MATLAB code. You can also estimate distribution parameters at the command line or construct probability distributions that correspond to the governing parameters.

Use the Distribution Fitting app to interactively fit a probability distribution to data.

Evaluating Goodness of Fit

Statistics and Machine Learning Toolbox provides statistical plots to evaluate how well a data set matches a specific distribution. The toolbox includes probability plots for a variety of standard distributions, including normal, exponential, extreme value, lognormal, Rayleigh, and Weibull. You can generate probability plots from complete data sets and censored data sets. Additionally, you can use quantile-quantile plots to evaluate how well a given distribution matches a standard normal distribution.

Statistics and Machine Learning Toolbox also provides hypothesis tests to determine whether a data set is consistent with different probability distributions. Specific distribution tests include:

  • Anderson-Darling tests
  • One-sided and two-sided Kolmogorov-Smirnov tests
  • Chi-square goodness-of-fit tests
  • Lilliefors tests
  • Ansari-Bradley tests
  • Jarque-Bera tests
  • Durbin-Watson tests
Perform maximum likelihood estimation on truncated, weighted, or bimodal data.

Generating Random Numbers

The toolbox provides functions for generating pseudorandom and quasi-random number streams from probability distributions. You can generate random numbers from either a fitted or a constructed probability distribution by applying the random method. Statistics and Machine Learning Toolbox also provides functions for:

  • Generating random samples from multivariate distributions, such as t, normal, copulas, and Wishart
  • Sampling from finite populations
  • Performing Latin hypercube sampling
  • Generating samples from Pearson and Johnson systems of distributions

You can also generate quasi-random number streams. Quasi-random number streams produce highly uniform samples from the unit hypercube. Quasi-random number streams can often accelerate Monte Carlo simulations because fewer samples are required to achieve complete coverage.

Use copulas to generate data from multivariate distributions when there are complicated relationships among the variables, or when the individual variables are from different distributions.

Hypothesis Testing, DOE, and Statistical Process Control


Hypothesis Testing

Random variation can make it difficult to determine whether samples taken under different conditions are actually different. Hypothesis testing is an effective tool for analyzing whether sample-to-sample differences are significant and require further evaluation, or are consistent with random and expected data variation.

Statistics and Machine Learning Toolbox supports widely used parametric and nonparametric hypothesis testing procedures, including:

  • One sample and two sample t-tests
  • Nonparametric tests for one sample, paired samples, and two independent samples
  • Distribution tests (chi-square, Jarque-Bera, Lilliefors, and Kolmogorov-Smirnov)
  • Comparison of distributions (two-sample Kolmogorov-Smirnov)
  • Tests for autocorrelation and randomness
  • Linear hypothesis tests on regression coefficients
Calculate the sample size necessary for a hypothesis test.

Design of Experiments (DOE)

You can use Statistics and Machine Learning Toolbox to define, analyze, and visualize a customized design of experiments (DOE). Functions for DOE enable you to create and test practical plans to gather data for statistical modeling. These plans show how to manipulate data inputs in tandem to generate information about their effects on data outputs. Supported design types include:

  • Full factorial
  • Fractional factorial
  • Response surface (central composite and Box-Behnken)
  • D-optimal
  • Latin hypercube

For example, you can estimate input effects and input interactions using ANOVA, linear regression, and response surface modeling, and then visualize results through main effect plots, interaction plots, and multivariate charts.

Generate central composite designs and Box-Behnken designs.

Statistical Process Control

Statistics and Machine Learning Toolbox provides a set of functions that support statistical process control (SPC). These functions enable you to monitor and improve products or processes by evaluating process variability. With SPC functions, you can:

  • Perform gage repeatability and reproducibility studies
  • Estimate process capability
  • Create control charts
  • Apply Western Electric and Nelson control rules to control chart data
Visualize control limits of engine fan cooling process using control charts.

Big Data, Parallel Computing, and Code Generation

Use MATLAB tools with Statistics and Machine Learning Toolbox to perform computationally demanding and data-intensive statistical analysis.


Big Data

You can use many of the toolbox functions with tall arrays and tall tables to apply statistics and machine learning functions on out-of-memory data that have an arbitrary number of rows. This enables you to use familiar MATLAB code to work with large data sets on local disks. You can also use MATLAB Compiler™ to deploy the same MATLAB code to operate in big data environments such as Hadoop®.

See the toolbox documentation for a complete list of supported functions.

Predict flight departure delay based on a number of variables.

Parallel Computing

You can use Statistics and Machine Learning Toolbox with Parallel Computing Toolbox™ to speed up statistical computations including:

See the toolbox documentation for a complete list of supported functions.

Run the regression of insurance risk ratings for car imports using TreeBagger in parallel.

C Code Generation

You can use the toolbox with MATLAB Coder™ to generate portable and readable C code for select functions for classification, regression, clustering, descriptive statistics, and probability distributions. You can use the generated code to employ statistics and machine learning for:

  • Embedded systems development
  • Integration with other software
  • Accelerating computationally intensive MATLAB code
Generate C code for a MATLAB function to estimate position of a moving object based on past noisy measurements.