Use anovan to fit models where a factor's levels represent a random selection from a larger (infinite) set of possible levels.
Perform N-way ANOVA on car data with mileage and other information on 406 cars made between 1970 and 1982.
Perform one-way ANOVA to determine whether data from several groups have a common mean.
Perform two-way ANOVA to determine the effect of car model and factory on the mileage rating of cars.
Perform statistical analysis and machine learning on out-of-memory data with MATLAB® and Statistics and Machine Learning Toolbox™.
Generate a nonlinear classifier with Gaussian kernel function. First, generate one class of points inside the unit disk in two dimensions, and another class of points in the annulus from
Perform linear and quadratic classification of Fisher iris data.
Create a classification tree ensemble for the ionosphere data set, and use it to predict the classification of a radar return with average measurements.
Use a random subspace ensemble to increase the accuracy of classification. It also shows how to use cross validation to determine good parameters for both the weak learner template and the
You can also use ensembles of decision trees for classification. For this example, use ionosphere data with 351 observations and 34 real-valued predictors. The response variable is
When you have missing data, trees and ensembles of trees give better predictions when they include surrogate splits. Furthermore, estimates of predictor importance are often different
Obtain the benefits of the LPBoost and TotalBoost algorithms. These algorithms share two beneficial characteristics:
The RobustBoost algorithm can make good classification predictions even when the training data has noise. However, the default RobustBoost parameters can produce an ensemble that does
Make a more robust and simpler model by trying to remove predictors without hurting the predictive power of the model. This is especially important when you have many predictors in your data.
Predict posterior probabilities of SVM models over a grid of observations, and then plot the posterior probabilities over the grid. Plotting posterior probabilities exposes decision
Determine which quadrant of an image a shape occupies by training an error-correcting output codes (ECOC) model comprised of linear SVM binary learners. This example also illustrates the
Train a basic discriminant analysis classifier to classify irises in Fisher's iris data.
Train an ensemble of classification trees using data containing predictors with many categorical levels.
Perform classification using discriminant analysis, naive Bayes classifiers, and decision trees. Suppose you have a data set containing observations with measurements on different
Use a custom kernel function, such as the sigmoid kernel, to train SVM classifiers, and adjust custom kernel function parameters.
Train an ensemble of classification trees with unequal classification costs. This example uses data on patients with hepatitis to see if they live or die as a result of the disease. The data
Tune the regularization parameter in fscnca using cross-validation. Tuning the regularization parameter helps to correctly detect the relevant features in the data.
Optimize an SVM classification using the bayesopt function. The classification works on locations of points from a Gaussian mixture model. In The Elements of Statistical Learning ,
Optimize an SVM classification using the fitcsvm function and OptimizeHyperparameters name-value pair. The classification works on locations of points from a Gaussian mixture model. In
Visualize posterior classification probabilities predicted by a naive Bayes classification model.
Perform five-fold cross validation of a quadratic discriminant analysis classifier.
Plot the decision surface of different classification algorithms.
Find the indices of the three nearest observations in X to each observation in Y with respect to the chi-square distance. This distance metric is used in correspondence analysis,
Predict classification for a k -nearest neighbor classifier.
Examine the quality of a k -nearest neighbor classifier using resubstitution and cross validation.
Add a MATLAB Function block to a Simulink® for label prediction. The MATLAB Function block accepts streaming data, and predicts the label and classification score using a trained, support
Generate C code from a MATLAB function that classifies images of digits using a trained classification model. This example demonstrates an alternative workflow to Digit Classification
Generate code from a prediction function that has variable-sized input arguments. Specifically, the example creates a function that predicts labels based on a trained classification
Certain classification and regression model classes have a predict function that supports code generation. Because prediction requires a trained classification or regression model
Generate C code from a MATLAB® System object™ that classifies images of digits using a trained classification model. This example also shows how to use the System object for classification
Use a Stateflow® chart for label prediction. The example trains a discriminant analysis model for the Fisher iris data set by using fitcdiscr, and defines a function for code generation that
Generate code for Statistics and Machine Learning Toolbox™ functions involving model objects by using the MATLAB® Coder™ app.
Several indexing and searching methods for categorical arrays.
Compute and compare measures of dispersion for sample data that contains one outlier.
Categorize numeric data into a categorical ordinal array using ordinal . This is useful for discretizing continuous data.
Select an observation or subset of observations from a dataset array.
Sort observations (rows) in a dataset array using the command line. You can also sort rows using the Variables editor.
Compute and compare measures of location for sample data that contains one outlier.
Compute summary statistics grouped by levels of a categorical variable. You can compute group summary statistics for a numeric array or a dataset array using grpstats .
Create a 3-by-3 matrix of sample data. Remove two data values by replacing them with NaN .
Create a dataset array from a numeric array existing in the MATLAB® workspace.
Change the labels for category levels in categorical arrays using setlabels . You also have the option to specify labels when creating a categorical array.
Reorder the category levels in nominal arrays using reorderlevels . By definition, nominal array categories have no natural ordering. However, you might want to change the order of levels
Merge categories in a nominal or ordinal array using mergelevels . This is useful for collapsing categories with few observations.
Create a dataset array from heterogeneous variables existing in the MATLAB® workspace.
Reorder the category levels in an ordinal array using reorderlevels .
Use cmdscale to perform classical (metric) multidimensional scaling, also known as principal coordinates analysis.
Analyze if companies within the same sector experience similar week-to-week changes in stock price.
Use Procrustes analysis to compare two handwritten number threes. Visually and analytically explore the effects of forcing size and reflection changes.
Perform feature selection that is robust to outliers using a custom robust loss function in NCA.
Visualize the MNIST data, which consists of images of handwritten digits, using the tsne function. The images are 28-by-28 pixels in grayscale. Each image has an associated label from 0
Estimate the cumulative distribution function (CDF) from data in a nonparametric or semiparametric way. It also illustrates the inversion method for generating random numbers from the
Use some more advanced techniques with the Statistics and Machine Learning Toolbox™ function mle to fit custom distributions to univariate data. The techniques include fitting models to
Fit the generalized extreme value distribution using maximum likelihood estimation. The extreme value distribution is used to model the largest or smallest value from a group or block of
The difference between fitting a curve to a set of points, and fitting a probability distribution to a sample of data.
Use the Statistics and Machine Learning Toolbox™ function mle to fit custom distributions to univariate data.
Analyze lifetime data with censoring. In biological or medical applications, this is known as survival analysis, and the times may represent the survival time of an organism or the time
Fit tail data to the Generalized Pareto distribution by maximum likelihood estimation.
Estimate and plot the cumulative hazard and survivor functions for different groups.
Find the empirical survivor functions and the parametric survivor functions using the Burr type XII distribution fit to data for two groups.
Construct a Cox proportional hazards model, and assess the significance of the predictor variables.
Visualize dissimilarity data using non-classical forms of multidimensional scaling (MDS).
Use Principal Components Analysis (PCA) to fit a linear regression. PCA minimizes the perpendicular distances from the data to the fitted model. This is the linear case of what is known as
Visualize multivariate data using various statistical plots. Many statistical analyses involve only two variables: a predictor variable and a response variable. Such data are easy to
Perform "classical" multidimensional scaling, using the cmdscale function in the Statistics and Machine Learning Toolbox™. Classical multidimensional scaling, also known as
Select features for classifying high-dimensional data. More specifically, it shows how to perform sequential feature selection, which is one of the most popular feature selection
Compute and plot the pdf of a Poisson distribution with parameter lambda = 5 .
Use copulafit to calibrate copulas with data. To generate data Xsim with a distribution "just like" (in terms of marginal distributions and correlations) the distribution of data in the
Similar to the bootstrap is the jackknife, which uses resampling to estimate the bias of a sample statistic. Sometimes it is also used to estimate standard error of the sample statistic. The
Plot the pdf of a bivariate Student's t distribution. You can use this distribution for a higher number of dimensions as well, although visualization is not easy.
Compute and plot the pdf using four different values for the parameter r , the desired number of successes: .1 , 1 , 3 , and 6 . In each case, the probability of success p is .5 .
As for all discrete distributions, the cdf is a step function. The plot shows the discrete uniform cdf for N = 10.
The bootstrap procedure involves choosing random samples with replacement from a data set and analyzing each sample the same way. Sampling with replacement means that each observation is
Compute the pdf of an F distribution with 5 numerator degrees of freedom and 3 denominator degrees of freedom.
Compute the pdf of a gamma distribution with parameters A = 100 and B = 10 . For comparison, also compute the pdf of a normal distribution with parameters mu = 1000 and sigma = 100 .
Compute the pdf of an exponential distribution with parameter mu = 2 .
Suppose the income of a family of four in the United States follows a lognormal distribution with mu = log(20,000) and sigma = 1 . Compute and plot the income density.
Compute the pdf for a Student's t distribution with parameter nu = 5 , and for a standard normal distribution.
Compute the pdf of a chi-square distribution with 4 degrees of freedom.
Compute and plot the cdf of a hypergeometric distribution.
Since the bivariate normal distribution is defined on the plane, you can also compute cumulative probabilities over rectangular regions.
Generate examples of probability density functions for the three basic forms of the generalized extreme value distribution.
Suppose the probability of a five-year-old car battery not starting in cold weather is 0.03. What is the probability of the car starting for 25 consecutive days during a long cold snap?
Compute the pdf of three generalized Pareto distributions. The first has shape parameter k = -0.25 , the second has k = 0 , and the third has k = 1 .
The lognrnd function simulates independent lognormal random variables. In the following example, the mvnrnd function generates n pairs of independent normal random variables, and then
In this example, use a database of 1985 car imports with 205 observations, 25 predictors, and 1 response, which is insurance risk rating, or "symboling." The first 15 variables are numeric
Test for the significance of the regression coefficients using t-statistic.
Display R-squared (coefficient of determination) and adjusted R-squared. Load the sample data and define the response and independent variables.
Fit a linear regression model. A typical workflow involves the following: import data, fit a regression, test its quality, modify it to improve the quality, and share it.
Compute the covariance matrix and standard errors of the coefficients.
Use the CovRatio statistics to determine the influential points in data. Load the sample data and define the response and predictor variables.
Uses a bagged ensemble so it can use all three methods of evaluating ensemble quality.
Identify and remove redundant predictors from a generalized linear model.
Uses data for predicting the insurance risk of a car based on its many attributes.
View a classification or regression tree. There are two ways to view a tree: view(tree) returns a text description and view(tree,'mode','graph') returns a graphic description of the tree.
Determine the observations that are influential on the fitted response values using Dffits values. Load the sample data and define the response and independent variables.
Test for autocorrelation among the residuals of a linear regression model.
Fit a generalized linear model and analyze the results. A typical workflow involves the following: import data, fit a generalized linear model, test its quality, modify it to improve the
Determine the observations that have large influence on coefficients using Dfbetas . Load the sample data and define the response and independent variables.
Regularize binomial regression. The default (canonical) link function for binomial regression is the logistic function.
Compute Leverage values and assess high leverage observations. Load the sample data and define the response and independent variables.
Assess the model assumptions by examining the residuals of a fitted linear regression model.
Assess the fit of the model and the significance of the regression coefficients using the F-statistic.
There are diagnostic plots to help you examine the quality of a model. plotDiagnostics(mdl) gives a variety of plots, including leverage and Cook's distance plots. plotResiduals(mdl)
Use the methods predict , feval , and random to predict and simulate responses to new data.