Load the Fisher iris sample data. The data contains length and width measurements from the sepals and petals of three species of iris flowers. Store the petal length data for the versicolor
Use quantile-quantile (q-q) plots to determine whether two samples come from the same distribution family. Q-Q plots are scatter plots of quantiles computed from each sample, with a line
Use anovan to fit models where a factor's levels represent a random selection from a larger (infinite) set of possible levels.
Perform N-way ANOVA on car data with mileage and other information on 406 cars made between 1970 and 1982.
Perform one-way ANOVA to determine whether data from several groups have a common mean.
Perform two-way ANOVA to determine the effect of car model and factory on the mileage rating of cars.
Perform statistical analysis and machine learning on out-of-memory data with MATLAB® and Statistics and Machine Learning Toolbox™.
Use logistic regression and other techniques to perform data analysis on tall arrays. Tall arrays represent data that is too large to fit into computer memory.
Generate a nonlinear classifier with Gaussian kernel function. First, generate one class of points inside the unit disk in two dimensions, and another class of points in the annulus from
Perform linear and quadratic classification of Fisher iris data.
Create a classification tree ensemble for the ionosphere data set, and use it to predict the classification of a radar return with average measurements.
Use a random subspace ensemble to increase the accuracy of classification. It also shows how to use cross validation to determine good parameters for both the weak learner template and the
You can also use ensembles of decision trees for classification. For this example, use ionosphere data with 351 observations and 34 real-valued predictors. The response variable is
When you have missing data, trees and ensembles of trees give better predictions when they include surrogate splits. Furthermore, estimates of predictor importance are often different
Obtain the benefits of the LPBoost and TotalBoost algorithms. These algorithms share two beneficial characteristics:
The RobustBoost algorithm can make good classification predictions even when the training data has noise. However, the default RobustBoost parameters can produce an ensemble that does
Make a more robust and simpler model by trying to remove predictors without hurting the predictive power of the model. This is especially important when you have many predictors in your data.
Predict posterior probabilities of SVM models over a grid of observations, and then plot the posterior probabilities over the grid. Plotting posterior probabilities exposes decision
Determine which quadrant of an image a shape occupies by training an error-correcting output codes (ECOC) model comprised of linear SVM binary learners. This example also illustrates the
Train a basic discriminant analysis classifier to classify irises in Fisher's iris data.
Train an ensemble of classification trees using data containing predictors with many categorical levels.
Perform classification using discriminant analysis, naive Bayes classifiers, and decision trees. Suppose you have a data set containing observations with measurements on different
Use a custom kernel function, such as the sigmoid kernel, to train SVM classifiers, and adjust custom kernel function parameters.
Train an ensemble of classification trees with unequal classification costs. This example uses data on patients with hepatitis to see if they live or die as a result of the disease. The data
Optimize an SVM classification using the bayesopt function. The classification works on locations of points from a Gaussian mixture model. In The Elements of Statistical Learning ,
Optimize an SVM classification using the fitcsvm function and OptimizeHyperparameters name-value pair. The classification works on locations of points from a Gaussian mixture model. In
Visualize posterior classification probabilities predicted by a naive Bayes classification model.
Perform five-fold cross validation of a quadratic discriminant analysis classifier.
Plot the decision surface of different classification algorithms.
Find the indices of the three nearest observations in X to each observation in Y with respect to the chi-square distance. This distance metric is used in correspondence analysis,
Predict classification for a k -nearest neighbor classifier.
Examine the quality of a k -nearest neighbor classifier using resubstitution and cross validation.
Add a MATLAB Function block to a Simulink® for label prediction. The MATLAB Function block accepts streaming data, and predicts the label and classification score using a trained, support
Generate C code from a MATLAB function that classifies images of digits using a trained classification model. This example demonstrates an alternative workflow to Digit Classification
Specify variable-size input arguments when you generate code for the object functions of classification and regression model objects. Variable-size data is data whose size might change
Generate code for the prediction of classification and regression model objects at the command line. You can also generate code using the MATLAB® Coder™ app. See Code Generation for
Generate C code from a MATLAB® System object™ that classifies images of digits using a trained classification model. This example also shows how to use the System object for classification
Use a Stateflow® chart for label prediction. The example trains a discriminant analysis model for the Fisher iris data set by using fitcdiscr, and defines a function for code generation that
Generate C/C++ code for the prediction of classification and regression model objects by using the MATLAB® Coder™ app. You can also generate code at the command line using codegen . See Code
Several indexing and searching methods for categorical arrays.
Compute and compare measures of dispersion for sample data that contains one outlier.
Categorize numeric data into a categorical ordinal array using ordinal . This is useful for discretizing continuous data.
Select an observation or subset of observations from a dataset array.
Sort observations (rows) in a dataset array using the command line. You can also sort rows using the Variables editor.
Compute and compare measures of location for sample data that contains one outlier.
Compute summary statistics grouped by levels of a categorical variable. You can compute group summary statistics for a numeric array or a dataset array using grpstats .
Create a 3-by-3 matrix of sample data. Remove two data values by replacing them with NaN .
Create a dataset array from a numeric array existing in the MATLAB® workspace.
Change the labels for category levels in categorical arrays using setlabels . You also have the option to specify labels when creating a categorical array.
Reorder the category levels in nominal arrays using reorderlevels . By definition, nominal array categories have no natural ordering. However, you might want to change the order of levels
Merge categories in a nominal or ordinal array using mergelevels . This is useful for collapsing categories with few observations.
Create a dataset array from heterogeneous variables existing in the MATLAB® workspace.
Reorder the category levels in an ordinal array using reorderlevels .
Visualize multivariate data using various statistical plots. Many statistical analyses involve only two variables: a predictor variable and a response variable. Such data are easy to
Use cmdscale to perform classical (metric) multidimensional scaling, also known as principal coordinates analysis.
Analyze if companies within the same sector experience similar week-to-week changes in stock price.
Use Procrustes analysis to compare two handwritten number threes. Visually and analytically explore the effects of forcing size and reflection changes.
Visualize dissimilarity data using nonclassical forms of multidimensional scaling (MDS).
Use Principal Components Analysis (PCA) to fit a linear regression. PCA minimizes the perpendicular distances from the data to the fitted model. This is the linear case of what is known as
Perform classical multidimensional scaling using the cmdscale function in Statistics and Machine Learning Toolbox™. Classical multidimensional scaling, also known as Principal
Select features for classifying high-dimensional data. More specifically, it shows how to perform sequential feature selection, which is one of the most popular feature selection
Perform factor analysis using Statistics and Machine Learning Toolbox™.
Tune the regularization parameter in fscnca using cross-validation. Tuning the regularization parameter helps to correctly detect the relevant features in the data.
Perform feature selection that is robust to outliers using a custom robust loss function in NCA.
Visualize the MNIST data, which consists of images of handwritten digits, using the tsne function. The images are 28-by-28 pixels in grayscale. Each image has an associated label from 0
Hypothesis testing is a common method of drawing inferences about a population based on statistical evidence from a sample.
Estimate and plot the cumulative hazard and survivor functions for different groups.
Find the empirical survivor functions and the parametric survivor functions using the Burr type XII distribution fit to data for two groups.
Construct a Cox proportional hazards model, and assess the significance of the predictor variables.
Compute and plot the pdf of a Poisson distribution with parameter lambda = 5 .
Use copulafit to calibrate copulas with data. To generate data Xsim with a distribution "just like" (in terms of marginal distributions and correlations) the distribution of data in the
Similar to the bootstrap is the jackknife, which uses resampling to estimate the bias of a sample statistic. Sometimes it is also used to estimate standard error of the sample statistic. The
Plot the pdf of a bivariate Student's t distribution. You can use this distribution for a higher number of dimensions as well, although visualization is not easy.
Compute and plot the pdf using four different values for the parameter r , the desired number of successes: .1 , 1 , 3 , and 6 . In each case, the probability of success p is .5 .
As for all discrete distributions, the cdf is a step function. The plot shows the discrete uniform cdf for N = 10.
The bootstrap procedure involves choosing random samples with replacement from a data set and analyzing each sample the same way. Sampling with replacement means that each observation is
Compute the pdf of an F distribution with 5 numerator degrees of freedom and 3 denominator degrees of freedom.
Compute the pdf of a gamma distribution with parameters A = 100 and B = 10 . For comparison, also compute the pdf of a normal distribution with parameters mu = 1000 and sigma = 100 .
Compute the pdf of an exponential distribution with parameter mu = 2 .
Suppose the income of a family of four in the United States follows a lognormal distribution with mu = log(20,000) and sigma = 1 . Compute and plot the income density.
Compute the pdf for a Student's t distribution with parameter nu = 5 , and for a standard normal distribution.
Compute the pdf of a chi-square distribution with 4 degrees of freedom.
Compute and plot the cdf of a hypergeometric distribution.
Since the bivariate normal distribution is defined on the plane, you can also compute cumulative probabilities over rectangular regions.
Generate examples of probability density functions for the three basic forms of the generalized extreme value distribution.
Suppose the probability of a five-year-old car battery not starting in cold weather is 0.03. What is the probability of the car starting for 25 consecutive days during a long cold snap?
Compute the pdf of three generalized Pareto distributions. The first has shape parameter k = -0.25 , the second has k = 0 , and the third has k = 1 .
The lognrnd function simulates independent lognormal random variables. In the following example, the mvnrnd function generates n pairs of independent normal random variables, and then
In this example, use a database of 1985 car imports with 205 observations, 25 predictors, and 1 response, which is insurance risk rating, or "symboling." The first 15 variables are numeric
Test for the significance of the regression coefficients using t-statistic.
Display R-squared (coefficient of determination) and adjusted R-squared. Load the sample data and define the response and independent variables.
Fit a linear regression model. A typical workflow involves the following: import data, fit a regression, test its quality, modify it to improve the quality, and share it.
Compute the covariance matrix and standard errors of the coefficients.
Use the CovRatio statistics to determine the influential points in data. Load the sample data and define the response and predictor variables.
Uses a bagged ensemble so it can use all three methods of evaluating ensemble quality.
Identify and remove redundant predictors from a generalized linear model.
Uses data for predicting the insurance risk of a car based on its many attributes.
View a classification or regression tree. There are two ways to view a tree: view(tree) returns a text description and view(tree,'mode','graph') returns a graphic description of the tree.
Determine the observations that are influential on the fitted response values using Dffits values. Load the sample data and define the response and independent variables.
Test for autocorrelation among the residuals of a linear regression model.
Fit a generalized linear model and analyze the results. A typical workflow involves the following: import data, fit a generalized linear model, test its quality, modify it to improve the
Determine the observations that have large influence on coefficients using Dfbetas . Load the sample data and define the response and independent variables.
Regularize binomial regression. The default (canonical) link function for binomial regression is the logistic function.
Compute Leverage values and assess high leverage observations. Load the sample data and define the response and independent variables.
Assess the model assumptions by examining the residuals of a fitted linear regression model.
Assess the fit of the model and the significance of the regression coefficients using the F-statistic.
There are diagnostic plots to help you examine the quality of a model. plotDiagnostics(mdl) gives a variety of plots, including leverage and Cook's distance plots. plotResiduals(mdl)
Use the methods predict , feval , and random to predict and simulate responses to new data.
Rare events prediction in complex technical systems has been very interesting and critical issue for many industrial and commercial fields due to huge increase of sensors and rapid growth
Though Hotelling’s T-square method is applicable for many multi-dimensional data sets, this method has a fundamental assumption that the data follow a unimodal distribution. So, when the
Among many statistical anomaly detection techniques, Hotelling’s T-square method, a multivariate statistical analysis technique, has been one of the most typical method. This method
The previous methods, Hotelling’s T-square method and Gaussian mixture model, use Gaussian distribution-based parametric model. However, in practical situation, sometimes data
Demonstration of dot product, orthogonality also includes some vector addition. Information from this tutorial is used in qr decomposition and multiple regression regression approach
The time series from a SDOF is computed using the central difference method, and a white noise is used as an input force.
This demo showcases visualization and analysis (heavy statistics) for forecasting energy usage based on historical data. We have access to hour-by-hour utility usage for the month of
Demo file for the Data Management and Statistics Webinar. This demo requires the Statistics Toolbox and was created using MATLAB 7.7 (R2008b).
Statins are the most common class of drugs used for treating hyperlipdemia. However, studies have shown that even at their maximum dosage of 80 mg, many patients do not reach LDL cholesterol
This tutorial describes multivariate guassians as it walks through the major functioniality of the mmvn toolkit
Consider the hypercube and an inscribed hypersphere with radius . Then the fraction of the volume of the cube contained in the hypersphere is given by:
The dynamic response of a 100 m high clamped-free steel beam is studied. Simulated time series are used, where the first three eigenmodes have been taken into account. More precisely, the
Linear Mixed-Effect (LME) Models are generalizations of linear regression models for data that is collected and summarized in groups. Linear Mixed- Effects models offer a flexible
In this demo, we will perform statistical analysis on automotive fuel economy data provided by the United States Environmental Protection Agency. We will see how the Statistics Toolbox™
This tutorial will go over some of the functions available for making inferences and testing hypothesis. I assume that you know how to construct a model using encode. If not see the
Linstats package provides a uniform mechanism for building any supported linear model. Once built the same model can be analyzed in many ways including least-squares regression, fit and
In this script, I reproduce the results presented by John D. Holmes in the first part of the chapter 2 of his book: Wind loading of structures . The notations he uses are slightly different in