Use anovan to fit models where a factor's levels represent a random selection from a larger (infinite) set of possible levels.
Perform N-way ANOVA on car data with mileage and other information on 406 cars made between 1970 and 1982.
Perform one-way ANOVA to determine whether data from several groups have a common mean.
Perform two-way ANOVA to determine the effect of car model and factory on the mileage rating of cars.
Generate a nonlinear classifier with Gaussian kernel function. First, generate one class of points inside the unit disk in two dimensions, and another class of points in the annulus from
Perform linear and quadratic classification of Fisher iris data.
Use a random subspace ensemble to increase the accuracy of classification. It also shows how to use cross validation to determine good parameters for both the weak learner template and the
You can also use ensembles of decision trees for classification. For this example, use ionosphere data with 351 observations and 34 real-valued predictors. The response variable is
When you have missing data, trees and ensembles of trees give better predictions when they include surrogate splits. Furthermore, estimates of predictor importance are often different
Create a classification tree ensemble for the ionosphere data set, and use it to predict the classification of a radar return with average measurements.
Demonstrates fitting a non-linear regression tree model to hourly day-ahead electricity prices in the New England pool region. The log electricity prices are modeled with two additive
Obtain the benefits of the LPBoost and TotalBoost algorithms. These algorithms share two beneficial characteristics:
The RobustBoost algorithm can make good classification predictions even when the training data has noise. However, the default RobustBoost parameters can produce an ensemble that does
Small, round blue-cell tumors (SRBCTs) belong to four distinct diagnostic categories. The categories have widely differing prognoses and treatment options, making it extremely
Predict posterior probabilities of SVM models over a grid of observations, and then plot the posterior probabilities over the grid. Plotting posterior probabilities exposes decision
Make a more robust and simpler model by trying to remove predictors without hurting the predictive power of the model. This is especially important when you have many predictors in your data.
Determine which quadrant of an image a shape occupies by training an error-correcting output codes (ECOC) model comprised of linear SVM binary learners. This example also illustrates the
Train a basic discriminant analysis classifier to classify irises in Fisher's iris data.
Train an ensemble of classification trees using data containing predictors with many categorical levels.
Perform classification using discriminant analysis, naive Bayes classifiers, and decision trees. Suppose you have a data set containing observations with measurements on different
Use a custom kernel function, such as the sigmoid kernel, to train SVM classifiers, and adjust custom kernel function parameters.
Train an ensemble of classification trees with unequal classification costs. This example uses data on patients with hepatitis to see if they live or die as a result of the disease. The data
Tune the regularization parameter in fscnca using cross-validation. Tuning the regularization parameter helps to correctly detect the relevant features in the data.
Optimize an SVM classification. The classification works on locations of points from a Gaussian mixture model. In The Elements of Statistical Learning, Hastie, Tibshirani, and Friedman
Optimize an SVM classification using the fitcsvm function and OptimizeHyperparameters name-value pair. The classification works on locations of points from a Gaussian mixture model. In
Products used: Statistics and Machine Learning Toolbox™ and MATLAB® Coder™.
Find the indices of the three nearest observations in X to each observation in Y with respect to the chi-square distance. This distance metric is used in correspondence analysis,
Gaussian mixture models (GMM) are often used for data clustering. Usually, fitted GMMs cluster by assigning query data points to the multivariate normal components that maximize the
Predict classification for a k-nearest neighbor classifier.
Examine the quality of a k-nearest neighbor classifier using resubstitution and cross validation.
Add a MATLAB Function block to a Simulink® for label prediction. The MATLAB Function block accepts streaming data, and predicts the label and classification score using a trained, support
Products used: Statistics and Machine Learning Toolbox™, MATLAB® Coder™, Simulink®, and Computer Vision Toolbox™.
Generate code from a prediction function that has variable-sized input arguments. Specifically, the example creates a function that predicts labels based on a trained classification
Use a Stateflow® chart for label prediction. The example trains a discriminant analysis model for the Fisher iris data set by using fitcdiscr, and declares a function for code generation
Several indexing and searching methods for categorical arrays.
Compute and compare measures of dispersion for sample data that contains one outlier.
Categorize numeric data into a categorical ordinal array using ordinal. This is useful for discretizing continuous data.
Select an observation or subset of observations from a dataset array.
Compute and compare measures of location for sample data that contains one oulier.
Compute summary statistics grouped by levels of a categorical variable. You can compute group summary statistics for a numeric array or a dataset array using grpstats.
Sort observations (rows) in a dataset array using the command line. You can also sort rows using the Variables editor.
Create a 3-by-3 matrix of sample data. Remove two data values by replacing them with NaN.
Create a dataset array from a numeric array existing in the MATLAB® workspace.
Change the labels for category levels in categorical arrays using setlabels. You also have the option to specify labels when creating a categorical array.
Merge categories in a categorical array using mergelevels. This is useful for collapsing categories with few observations.
Reorder the category levels in nominal arrays using reorderlevels. By definition, nominal array categories have no natural ordering. However, you might want to change the order of levels
Create a dataset array from heterogeneous variables existing in the MATLAB® workspace.
Use cmdscale to perform classical (metric) multidimensional scaling, also known as principal coordinates analysis.
Analyze if companies within the same sector experience similar week-to-week changes in stock price.
Use Procrustes analysis to compare two handwritten number threes. Visually and analytically explore the effects of forcing size and reflection changes.
Perform feature selection that is robust to outliers using a custom robust loss function in NCA.
Visualize the MNIST data, which consists of images of handwritten digits, using the tsne function. The images are 28-by-28 pixels in grayscale. Each image has an associated label from 0
Use rica to disentangle mixed audio signals. You can use rica to perform independent component analysis (ICA) when prewhitening is included as a preprocessing step. The ICA model is
This demo showcases visualization and analysis (heavy statistics) for forecasting energy usage based on historical data. We have access to hour-by-hour utility usage for the month of
This demo showcases visualization and analysis (heavy statistics) for forecasting energy usage based on historical data. We have access to hour-by-hour utility usage for the month of
The problem of Feature Selection is: Given a (usually large) number of noisy and partly redundant variables and a target that we would like to predict, choose a small but indicative subset as
Demo file for the Data Management and Statistics Webinar. This demo requires the Statistics Toolbox and was created using MATLAB 7.7 (R2008b).
The CGDS toolbox provides a set of functions for retrieving data from the cBio Cancer Genomics Data Portal web API. Get started by adding the CGDS toolbox directory to the path and setting the
Estimate the cumulative distribution function (CDF) from data in a non-parametric or semi-parametric fashion. It also illustrates the inversion method for generating random numbers
Use some more advanced techniques with the Statistics and Machine Learning Toolbox™ function mle to fit custom distributions to univariate data. The techniques include fitting models to
Fit the generalized extreme value distribution using maximum likelihood estimation. The extreme value distribution is used to model the largest or smallest value from a group or block of
The difference between fitting a curve to a set of points, and fitting a probability distribution to a sample of data.
Use the Statistics and Machine Learning Toolbox™ function mle to fit custom distributions to univariate data.
Analyze lifetime data with censoring. In biological or medical applications, this is known as survival analysis, and the times may represent the survival time of an organism or the time
Fit tail data to the Generalized Pareto distribution by maximum likelihood estimation.
Compare multiple sample distributions using a statistic of the maximum difference in probability of cumulative distributions. An extension of the Kolomogorov-Smirnov test to compare
Use hypothesis testing to analyze gas prices measured across the state of Massachusetts during two separate months.
Demo file from August 7, 2007 webinar titled Data Analysis with Statistics Toolbox and Curve Fitting Toolbox. View the recorded webinar:
Estimate and plot the cumulative hazard and survivor functions for different groups.
This is a walkthrough of the Demo shown in the 30 November 2006 Webinar titled "Using Statistics for Uncertainty Analysis in System Models". The demo covers two basic topics:
Find the empirical survivor functions and the parametric survivor functions using the Burr type XII distribution fit to data for two groups.
This demo uses MATLAB, the Statistics Toolbox, the Curve Fitting Toolbox, and the Optimization Toolbox to improve the design of an engine cooling fan using Design for Six Sigma Techniques.
Convert survival data to counting process form and then construct a Cox proportional hazards model with time-dependent covariates.
Machine learning techniques are often used for financial analysis and decision-making tasks such as accurate forecasting, classification of risk, estimating probabilities of default,
Clustering is a form of unsupervised learning technique. The purpose of clustering is to identify natural groupings of data from a large data set to produce a concise representation based on
In the past decade the development of automatic techniques to estimate the intrinsic dimensionality of a given dataset has gained considerable attention due to its relevance in several
Demonstrates fitting a non-linear temperature model to hourly dry bulb temperatures recorded in the New England region. The temperature series is modeled as a sum of two compoments, a
The Natural Gas Price model, Temperature model and Electricity Price hybrid model are jointly simulated to create market scenarios. Then, given a set of plant parameters and constraints a
When the arrays are too large, computing the entire array may not fit entirely into memory. ipdm is smart enough to break the problem up to accomplish the task anyway. In this example, the
Visualize dissimilarity data using non-classical forms of multidimensional scaling (MDS).
Use Principal Components Analysis (PCA) to fit a linear regression. PCA minimizes the perpendicular distances from the data to the fitted model. This is the linear case of what is known as
Visualize multivariate data using various statistical plots. Many statistical analyses involve only two variables: a predictor variable and a response variable. Such data are easy to
Perform "classical" multidimensional scaling, using the cmdscale function in the Statistics and Machine Learning Toolbox™. Classical multidimensional scaling, also known as
Select features for classifying high-dimensional data. More specifically, it shows how to perform sequential feature selection, which is one of the most popular feature selection
Compute and plot the pdf of a Poisson distribution with parameter lambda = 5.
Use copulafit to calibrate copulas with data. To generate data Xsim with a distribution "just like" (in terms of marginal distributions and correlations) the distribution of data in the
Similar to the bootstrap is the jackknife, which uses resampling to estimate the bias of a sample statistic. Sometimes it is also used to estimate standard error of the sample statistic. The
Plot the pdf of a bivariate Student's t distribution. You can use this distribution for a higher number of dimensions as well, although visualization is not easy.
Compute and plot the pdf using four different values for the parameter r, the desired number of successes: .1, 1, 3, and 6. In each case, the probability of success p is .5.
As for all discrete distributions, the cdf is a step function. The plot shows the discrete uniform cdf for N = 10.
The bootstrap procedure involves choosing random samples with replacement from a data set and analyzing each sample the same way. Sampling with replacement means that each observation is
Compute the pdf of an F distibution with 5 numerator degrees of freedom and 3 denominator degrees of freedom.
Compute the pdf of a gamma distribution with parameters A = 100 and B = 10. For comparison, also compute the pdf of a normal distribution with parameters mu = 1000 and sigma = 100.
Compute the pdf of an exponential distribution with parameter mu = 2.
Suppose the income of a family of four in the United States follows a lognormal distribution with mu = log(20,000) and sigma = 1. Compute and plot the income density.
Compute the pdf for a Student's t distribution with parameter nu = 5, and for a standard normal distribution.
Compute the pdf of a chi-square distribution with 4 degrees of freedom.
Compute and plot the cdf of a hypergeometric distribution.
Several examples show how to use the gkdeb function.
Since the bivariate normal distribution is defined on the plane, you can also compute cumulative probabilities over rectangular regions.
Generate examples of probability density functions for the three basic forms of the generalized extreme value distribution.
Suppose the probability of a five-year-old car battery not starting in cold weather is 0.03. What is the probability of the car starting for 25 consecutive days during a long cold snap?
Compute the pdf of three generalized Pareto distributions. The first has shape parameter k = -0.25, the second has k = 0, and the third has k = 1.
In this example, use a database of 1985 car imports with 205 observations, 25 predictors, and 1 response, which is insurance risk rating, or "symboling." The first 15 variables are numeric
Test for the significance of the regression coefficients using t-statistic.
Display R-squared (coefficient of determination) and adjusted R-squared. Load the sample data and define the response and independent variables.
Fit a linear regression model. A typical workflow involves the following: import data, fit a regression, test its quality, modify it to improve the quality, and share it.
Compute the covariance matrix and standard errors of the coefficients.
Use the CovRatio statistics to determine the influential points in data. Load the sample data and define the response and predictor variables.
Identify and remove redundant predictors from a generalized linear model.
Uses a bagged ensemble so it can use all three methods of evaluating ensemble quality.
View a classification or regression tree. There are two ways to view a tree: view(tree) returns a text description and view(tree,'mode','graph') returns a graphic description of the tree.
Uses data for predicting the insurance risk of a car based on its many attributes.
Test for autocorrelation among the residuals of a linear regression model.
Determine the observations that are influential on the fitted response values using Dffits values. Load the sample data and define the response and independent variables.
Fit a generalized linear model and analyze the results. A typical workflow involves the following: import data, fit a generalized linear model, test its quality, modify it to improve the
Determine the observations that have large influence on coefficients using Dfbetas. Load the sample data and define the response and independent variables.
Regularize binomial regression. The default (canonical) link function for binomial regression is the logistic function.
Compute Leverage values and assess high leverage observations. Load the sample data and define the response and independent variables.
Assess the model assumptions by examining the residuals of a fitted linear regression model.
Use assess the fit of the model and the significance of the regression coefficients using F-statistic.
There are diagnostic plots to help you examine the quality of a model. plotDiagnostics(mdl) gives a variety of plots, including leverage and Cook's distance plots. plotResiduals(mdl)
Use the methods predict , feval , and random to predict and simulate responses to new data.
Linear Mixed-Effect (LME) Models are generalizations of linear regression models for data that is collected and summarized in groups. Linear Mixed- Effects models offer a flexible
Principal Component Analysis (PCA) and Partial Least Squares (PLS) are widely used tools. This code is to show their relationship through the Nonlinear Iterative PArtial Least Squares
In this demo, we will perform statistical analysis on automotive fuel economy data provided by the United States Environmental Protection Agency. We will see how the Statistics Toolbox™
This script demonstrates the use of constrained polynomials to compute a constrained approximation. That is an approximation which fulfills predefined constraints. Such problems occur
Human activity sensor data contains observations derived from sensor measurements taken from smartphones worn by people while doing different activities (walking, lying, sitting etc).
Rare events prediction in complex technical systems has been very interesting and critical issue for many industrial and commercial fields due to huge increase of sensors and rapid growth
Construct a map of 10 US cities based on the distances between those cities, using cmdscale.
Demonstration of dot product, orthogonality also includes some vector addition. Information from this tutorial is used in qr decomposition and multiple regression regression approach
Linstats package provides a uniform mechanism for building any supported linear model. Once built the same model can be analyzed in many ways including least-squares regression, fit and
This demo showcases visualization and analysis for forecasting energy demand based on historical data. We have access to hour-by-hour utility usage for the year 2006, including
Examples A and B make it clear that if we are trying to view uniform data over the hypercube most (spherical) neighborhoods will be empty! Let us examine what happens if the data follow the
Consider the hypercube and an inscribed hypersphere with radius . Then the fraction of the volume of the cube contained in the hypersphere is given by:
The dynamic response of a 100 m high clamped-free steel beam is studied. Simulated time series are used, where the first three eigen-modes have been taken into account. More precisely, the
This case study analyzes the amount of vibration a passenger experiences for a vehicle traveling over a road disturbance (bump). We want to determine the amount of reduction in displacement
The time series from a SDOF is computed using the central difference method, and a white noise is used as an input force.
Hypothesis testing based on a model that is invalid can lead to faulty conclusions. this tutorial goes over a few basic diagnostic procedures that can be used to test whether a model is valid.
This is an enhanced version of the regstats function (statistics toolbox). Here are implemented several ways to estimate robust standard errors (se) for the coefficients\n. Also, it
This tutorial will go over some of the functions available for making inferences and testing hypothesis. I assume that you know how to construct a model using encode. If not see the
This file describes the development of a failure boundary identication algorithm shown in "Using Statistics and Optimization to Support Design Activities" Webinar, July 21, 2009.
Among many statistical anomaly detection techniques, Hotelling’s T-square method, a multivariate statistical analysis technique, has been one of the most typical method. This method
This tutorial describes multivariate guassians as it walks through the major functioniality of the mmvn toolkit
In this script, I reproduce the results presented by John D. Holmes in the first part of the chapter 2 of his book: Wind loading of structures . The notations he uses are slightly different in
From: "Using Statistics and Optimization to Support Design Activities" Webinar, July 21, 2009.
In order to illustrate some of the numerical calculations required for testing hypothesis on canonical variance components based on the LRT statistic (37), as presented in Example 3, let us
Though Hotelling’s T-square method is applicable for many multi-dimensional data sets, this method has a fundamental assumption that the data follow a unimodal distribution. So, when the