Statistics and Machine Learning Toolbox
Analyze and model data using statistics and machine learning
Statistics and Machine Learning Toolbox™ provides functions and apps to describe, analyze, and model data. You can use descriptive statistics and plots for exploratory data analysis, fit probability distributions to data, generate random numbers for Monte Carlo simulations, and perform hypothesis tests. Regression and classification algorithms let you draw inferences from data and build predictive models.
For multidimensional data analysis, Statistics and Machine Learning Toolbox provides feature selection, stepwise regression, principal component analysis (PCA), regularization, and other dimensionality reduction methods that let you identify variables or features that impact your model.
The toolbox provides supervised and unsupervised machine learning algorithms, including support vector machines (SVMs), boosted and bagged decision trees, k-nearest neighbor, k-means, k-medoids, hierarchical clustering, Gaussian mixture models, and hidden Markov models. Many of the statistics and machine learning algorithms can be used for computations on data sets that are too big to be stored in memory.
Exploratory Data Analysis
Explore data through statistical plotting with interactive graphics and descriptive statistics. Identify patterns and features with clustering.
Visually explore data using probability plots, box plots, histograms, quantile-quantile plots, and advanced plots for multivariate analysis, such as dendrograms, biplots, and Andrews plots.
Understand and describe potentially large sets of data quickly using a few highly relevant numbers.
Discover patterns by grouping data using k-means, k-medoids, DBSCAN, hierarchical clustering, and Gaussian mixture and hidden Markov models.
Feature Extraction and Dimensionality Reduction
Transform raw data into features that are most suitable for machine learning. Iteratively explore and create new features, and select the ones that optimize performance.
Extract features from data using unsupervised learning techniques such as sparse filtering and reconstruction ICA. You can also use specialized techniques to extract features from images, signals, text, and numeric data.
Automatically identify the subset of features that provide the best predictive power in modeling the data. Feature selection methods include stepwise regression, sequential feature selection, regularization, and ensemble methods.
Feature Transformation and Dimensionality Reduction
Reduce dimensionality by transforming existing (non-categorical) features into new predictor variables where less descriptive features can be dropped. Feature transformation methods include PCA, factor analysis, and nonnegative matrix factorization.
Model a categorical response variable as a function of one or more predictors. Use a variety of parametric and nonparametric classification algorithms, including logistic regression, SVM, boosted and bagged decision trees, naive Bayes, discriminant analysis, and k-nearest neighbors.
Automated Model Optimization
Improve model performance by automatically tuning hyperparameters, selecting features, and addressing data set imbalances with cost matrices.
Regression and ANOVA
Model a continuous response variable as a function of one or more predictors, using linear and nonlinear regression, mixed-effects models, generalized linear models, and nonparametric regression. Assign variance to different sources using ANOVA.
Linear and Nonlinear Regression
Model behavior of complex systems with multiple predictors or response variables choosing from many linear and nonlinear regression algorithms. Fit multilevel or hierarchical, linear, nonlinear, and generalized linear mixed-effects models with nested and/or crossed random effects to perform longitudinal or panel analyses, repeated measures, and growth modeling.
Generate an accurate fit without specifying a model that describes the relationship between predictors and response, including SVMs, random forests, Gaussian processes, and Gaussian kernels.
Analysis of Variance (ANOVA)
Assign sample variance to different sources and determine whether the variation arises within or among different population groups. Use one-way, two-way, multiway, multivariate, and nonparametric ANOVA, as well as analysis of covariance (ANOCOVA) and repeated measures analysis of variance (RANOVA).
Probability Distributions and Hypothesis Tests
Fit distributions to data. Analyze whether sample-to-sample differences are significant or consistent with random data variation. Generate random numbers from various distributions.
Random Number Generation
Generate pseudorandom and quasi-random number streams from either a fitted or a constructed probability distribution.
Perform t-tests, distribution tests (Chi-square, Jarque-Bera, Lilliefors, and Kolmogorov-Smirnov), and nonparametric tests for one, paired, or independent samples. Test for autocorrection and randomness, and compare distributions (two-sample Kolmogorov-Smirnov).
Statistically analyze effects and data trends. Apply industrial statistical techniques such as a customized design of experiments and statistical process control.
Design of Experiments (DOE)
Define, analyze, and visualize a customized design of experiments (DOE). Create and test practical plans for how to manipulate data inputs in tandem to generate information about their effects on data outputs.
Statistical Process Control (SPC)
Monitor and improve products or processes by evaluating process variability. Create control charts, estimate process capability, and perform gage repeatability and reproducibility studies.
Reliability and Survival Analysis
Visualize and analyze time-to-failure data with and without censoring by performing Cox proportional hazards regression and fit distributions. Compute empirical hazard, survivor, cumulative distribution functions, and kernel density estimates.
Scale to Big Data and the Cloud
Apply statistical and machine learning techniques to out-of-memory data. Speed up statistical computations and machine learning model training on clusters and cloud instances.
Analyze Big Data with Tall Arrays
Use tall arrays and tables with many classification, regression, and clustering algorithms to train models on data sets that do not fit in memory without changing your code.
Speed up statistical computations and model training with parallelization.
Cloud and Distributed Computing
Use cloud instances to speed up statistical and machine learning computations. Perform the complete machine learning workflow in MATLAB Online™.
Deployment and Code Generation
Deploy statistics and machine learning to embedded systems, accelerate computationally intensive calculations using C code, and integrate with enterprise systems.
Generate portable and readable C or C++ code for inference of classification and regression algorithms, descriptive statistics, and probability distributions using MATLAB CoderTM. Accelerate verification and validation of your high-fidelity simulations using machine learning models through MATLAB function blocks and system blocks.
Integrate with Applications and Enterprise Systems
Deploy statistical and machine learning models as standalone, MapReduce, Spark™ applications, web apps, and Microsoft® Excel® add-ins using MATLAB Compiler™. Build C/C++ shared libraries, Microsoft .NET assemblies, Java® classes, and Python® packages using MATLAB Compiler SDK™.
Updating Deployed Models
Update parameters of deployed models without regenerating the C/C++ prediction code.
Machine Learner Apps
Optimize hyperparameters in Classification Learner and Regression Learner, and specify misclassification costs in Classification Learner
Update a deployed decision tree or linear model without regenerating code, and generate C/C++ code for probability distribution functions (requires MATLAB Coder)
Generate fixed-point C/C++ code for the prediction of an SVM model (requires MATLAB Coder and )
Perform spectral clustering using
Rank numeric and categorical features by their importance using a minimum redundancy maximum relevance (MRMR) algorithm and rank features for unsupervised learning using Laplacian scores