Statistics and Machine Learning Toolbox
Statistics and Machine Learning Toolbox provides multiple ways to explore data: statistical plotting with interactive graphics, algorithms for cluster analysis, and descriptive statistics for large data sets.
Statistics and Machine Learning Toolbox includes graphs and charts to visually explore your data. The toolbox augments MATLAB® plot types with probability plots, box plots, histograms, scatter histograms, 3D histograms, control charts, and quantile-quantile plots. The toolbox also includes specialized plots for multivariate analysis, including dendrograms, biplots, parallel coordinate charts, and Andrews plots.
Visualizing Multivariate Data (Example)
How to visualize multivariate data using various statistical plots.
Modeling Data with the Generalized Extreme Value Distribution (Example)
How to fit the generalized extreme value distribution using maximum likelihood estimation.
Descriptive statistics enable you to understand and describe potentially large sets of data quickly. Statistics and Machine Learning Toolbox includes functions for calculating:
These functions help you summarize values in a data sample using a few highly relevant numbers.
In some cases, performing inference on summary statistics using parametric methods is not possible. To deal with these cases, Statistics and Machine Learning Toolbox provides resampling techniques, including:
With regression, you can model a continuous response variable as a function of one or more predictors. Statistics and Machine Learning Toolbox offers a variety of regression algorithms, including linear regression, generalized linear models, nonlinear regression, and mixed-effects models.
Linear regression is a statistical modeling technique used to describe a continuous response variable as a function of one or more predictor variables. It can help you understand and predict the behavior of complex systems or analyze experimental, financial, and biological data.
The toolbox offers several types of linear regression models and fitting methods, including:
Computational Statistics: Feature Selection, Regularization, and Shrinkage with MATLAB
In this webinar, you will learn how to use Statistics and Machine Learning Toolbox to generate accurate predictive models from data sets that contain large numbers of correlated variables.
Nonlinear regression is a statistical modeling technique that helps describe nonlinear relationships in experimental data. Nonlinear regression models are generally assumed to be parametric, where the model is described as a nonlinear equation. Typically machine learning methods are used for nonparametric nonlinear regression.
The toolbox also offers robust nonlinear fitting to deal with outliers in the data.
Fitting with MATLAB: Statistics, Optimization, and Curve Fitting
In this webinar, you will learn applied curve fitting using MathWorks products. MathWorks engineers will present a series of techniques for solving real world challenges.
Generalized linear models are a special case of nonlinear models that use linear methods. They allow for the response variables to have nonnormal distributions and a link function that describes how the expected value of the response is related to the linear predictors.
Statistics and Machine Learning Toolbox supports fitting generalized linear models with the following response distributions:
Fitting Data with Generalized Linear Models (Example)
How to fit and evaluate generalized linear models using glmfit and glmval.
Linear and nonlinear mixed-effects models are generalizations of linear and nonlinear models for data that is collected and summarized in groups. These models describe the relationship between a response variable and independent variables, with coefficients that can vary with respect to one or more grouping variables.
Statistics and Machine Learning Toolbox supports fitting multilevel or hierarchical, linear, nonlinear, and generalized linear mixed-effects models with nested and/or crossed random effects, which can be used to perform a variety of studies, including:
The plot illustrated below compares two ways to forecast gross domestic product by state. The first plot shows the forecast of a fixed-effects model using state dummy variables. The second plot shows the forecast of a mixed-effects model with a random slope and intercept model. Solid circles indicate observations that were used to fit the model, and empty circles indicate observations that were excluded from the model.
This example uses the fitlme function in Statistics and Machine Learning Toolbox to fit the mixed-effects model that captures the overall trend. The mixed-effects model returns a much more confident forecast even in the presence of missing data.
Multilevel Mixed-Effects Modeling Using MATLAB
This webinar describes how to fit a variety of linear mixed-effects models to make statistical inferences about data and to generate accurate predictions.
Statistics and Machine Learning Toolbox enables you to perform model assessment for regression algorithms using tests for statistical significance and goodness-of-fit measures such as:
You can calculate confidence intervals for both regression coefficients and predicted values.
Statistics and Machine Learning Toolbox also supports nonparametric regression techniques for generating an accurate fit without specifying a model that describes the relationship between the predictor and the response. Nonparametric regression techniques can be more broadly classified under supervised machine learning for regression and include decision trees as well as boosted and bagged regression trees.
Develop a predictive model without specifying a function that describes the relationship between variables.
Analysis of variance (ANOVA) enables you to assign sample variance to different sources and determine whether the variation arises within or among different population groups. Statistics and Machine Learning Toolbox includes these ANOVA algorithms and related techniques:
Machine learning algorithms use computational methods to "learn" information directly from data without assuming a predetermined equation as a model. They can adaptively improve their performance as you increase the number of samples available for learning.
Machine Learning with MATLAB Overview
Learn how machine learning tools in MATLAB® can be used to solve regression, clustering, and classification problems.
Classification algorithms enable you to model a categorical response variable as a function of one or more predictors. Statistics and Machine Learning Toolbox offers an app and functions that cover a variety of parametric and nonparametric classification algorithms, such as:
An Introduction to Classification
Develop predictive models for classifying data.
You can evaluate goodness of fit for the resulting classification models using techniques such as:
You can also statistically assess the predictive accuracies of two classification models to determine if they are different, or if one classification model performs better than another:
The Classification Learner app lets you train models to classify data using supervised machine learning. You can use it to perform common tasks, such as:
Statistics and Machine Learning Toolbox offers multiple algorithms to analyze data using k-means, k-medoids, hierarchical clustering, Gaussian mixture models, or hidden Markov models. When the number of clusters is unknown, you can use cluster evaluation techniques to determine the number of clusters present in the data based on a specified metric.
Cluster Genes Using K-Means and Self-Organizing Maps (Example)
Learn how to detect patterns in gene expression profiles by examining gene expression data.
Cluster Analysis (Example)
Use k-means and hierarchical clustering to discover natural groupings in data.
Regression algorithms enable you to model a continuous response variable as a function of one or more predictors. Statistics and Machine Learning Toolbox offers a variety of parametric and nonparametric classification algorithms, such as:
Multivariate statistics provide algorithms and functions to analyze multiple variables. Typical applications include dimensionality reduction by feature transformation and feature selection, and exploring relationships between variables using visualization techniques, such as scatter plot matrices and classical multidimensional scaling.
Fitting an Orthogonal Regression Using Principal Components Analysis (Example)
Implement Deming regression (total least squares).
Feature transformation (sometimes called feature extraction) is a dimensionality reduction technique that transforms existing features into new features (predictor variables) where less descriptive features can be dropped. Statistics and Machine Learning Toolbox offers the following approaches for feature transformation:
Partial Least Squares Regression and Principal Components Regression (Example)
Model a response variable in the presence of highly correlated predictors.
Feature selection is a dimensionality reduction technique that selects only the subset of measured features (predictor variables) that provide the best predictive power in modeling the data. It is useful when you are dealing with high-dimensional data or when collecting data for all features is cost prohibitive.
Feature selection methods include the following:
You can use feature selection to:
Selecting Features for Classifying High-Dimensional Data (Example)
Select important features for cancer detection.
Statistics and Machine Learning Toolbox provides graphs and charts to explore multivariate data visually, including:
Statistics and Machine Learning Toolbox provides functions and an app to work with parametric and nonparametric probability distributions.
The toolbox lets you compute, fit, and generate samples from over 40 different distributions, including:
See the complete list of supported distributions.
With these tools, you can:
The Distribution Fitting app enables you to fit data using predefined univariate probability distributions, a nonparametric (kernel-smoothing) estimator, or a custom distribution that you define. This app supports both complete data and censored (reliability) data. You can exclude data, save and load sessions, and generate MATLAB code.
You can estimate distribution parameters at the command line or construct probability distributions that correspond to the governing parameters.
Additionally, you can create multivariate probability distributions, including Gaussian mixtures and multivariate normal, multivariate t, and Wishart distributions. You can use copulas to create multivariate distributions by joining arbitrary marginal distributions using correlation structures.
Simulating Dependent Random Variables Using Copulas (Example)
Create distributions that model correlated multivariate data.
With the toolbox, you can specify custom distributions and fit these distributions using maximum likelihood estimation.
Fitting Custom Univariate Distributions (Example)
Perform maximum likelihood estimation on truncated, weighted, or bimodal data.
Statistics and Machine Learning Toolbox provides statistical plots to evaluate how well a data set matches a specific distribution. The toolbox includes probability plots for a variety of standard distributions, including normal, exponential, extreme value, lognormal, Rayleigh, and Weibull. You can generate probability plots from complete data sets and censored data sets. Additionally, you can use quantile-quantile plots to evaluate how well a given distribution matches a standard normal distribution.
Statistics and Machine Learning Toolbox also provides hypothesis tests to determine whether a data set is consistent with different probability distributions. Specific distribution tests include:
Statistics and Machine Learning Toolbox provides functions for analyzing probability distributions, including:
The toolbox provides functions for generating pseudorandom and quasirandom number streams from probability distributions. You can generate random numbers from either a fitted or a constructed probability distribution by applying the random method.
Statistics and Machine Learning Toolbox also provides functions for:
You can also generate quasirandom number streams. Quasirandom number streams produce highly uniform samples from the unit hypercube. Quasirandom number streams can often accelerate Monte Carlo simulations because fewer samples are required to achieve complete coverage.
MATLAB Coder™ lets you generate portable and readable C code for more than 100 Statistics and Machine Learning Toolbox functions, including probability distribution, descriptive statistics, and machine learning. You can use the generated code for:
You can use Statistics and Machine Learning Toolbox with Parallel Computing Toolbox™ to decrease computation time. Statistics and Machine Learning Toolbox has built-in parallel computing support for algorithms such as cross-validation and bootstrapping. This also lets you speed up Monte Carlo simulation or other statistical problems.
Built-in support for parallel computing in Statistics and Machine Learning Toolbox enables you to run statistical computations in parallel to gain speed and to reduce the execution time of your program or functions.
You can speed up random number generation while maintaining the same statistical properties of random numbers generated without parallelization. This enables your computation using these random numbers to be completely reproducible.
Random variation can make it difficult to determine whether samples taken under different conditions are actually different. Hypothesis testing is an effective tool for analyzing whether sample-to-sample differences are significant and require further evaluation, or are consistent with random and expected data variation.
Statistics and Machine Learning Toolbox supports widely used parametric and nonparametric hypothesis testing procedures, including:
Selecting a Sample Size (Example)
Calculate the sample size necessary for a hypothesis test.
Functions for design of experiments (DOE) enable you to create and test practical plans to gather data for statistical modeling. These plans show how to manipulate data inputs in tandem to generate information about their effects on data outputs. Supported design types include:
You can use Statistics and Machine Learning Toolbox to define, analyze, and visualize a customized DOE. For example, you can estimate input effects and input interactions using ANOVA, linear regression, and response surface modeling, and then visualize results through main effect plots, interaction plots, and multivari charts.
Statistics and Machine Learning Toolbox provides a set of functions that support statistical process control (SPC). These functions enable you to monitor and improve products or processes by evaluating process variability. With SPC functions, you can: