| Products & Services | Industries | Academia | Support | User Community | Company |
| Download Product Updates | | | Get Pricing | | | Trial Software |
| Documentation → Statistics Toolbox |
| Contents | Index |
| Learn more about Statistics Toolbox |
| On this page… |
|---|
Cumulative Distribution Functions Inverse Cumulative Distribution Functions Distribution Statistics Functions Distribution Fitting Functions |
For each distribution supported by Statistics Toolbox software, a selection of the distribution functions described in this section is available for statistical programming. This section gives a general overview of the use of each type of function, independent of the particular distribution. For specific functions available for specific distributions, see Supported Distributions.
Probability density functions (pdfs) for supported Statistics Toolbox distributions all end with pdf, as in binopdf or exppdf. For more information on specific function names for specific distributions see Supported Distributions.
Each function represents a parametric family of distributions. Input arguments are arrays of outcomes followed by a list of parameter values specifying a particular member of the distribution family.
For discrete distributions, the pdf assigns a probability to each outcome. In this context, the pdf is often called a probability mass function (pmf).
For example, the discrete binomial pdf
![]()
assigns probability to the event of k successes in n trials of a Bernoulli process (such as coin flipping) with probability p of success at each trial. Each of the integers k = 0, 1, 2, ..., n is assigned a positive probability, with the sum of the probabilities equal to 1. Compute the probabilities with the binopdf function:
p = 0.2; % Probability of success for each trial n = 10; % Number of trials k = 0:n; % Outcomes m = binopdf(k,n,p); % Probability mass vector bar(k,m) % Visualize the probability distribution set(get(gca,'Children'),'FaceColor',[.8 .8 1]) grid on

For continuous distributions, the pdf assigns a probability density to each outcome. The probability of any single outcome is zero. The pdf must be integrated over a set of outcomes to compute the probability that an outcome falls within that set. The integral over the entire set of outcomes is 1.
For example, the continuous exponential pdf
![]()
is used to model the probability that a process with constant failure rate λ will have a failure within time t . Each time t > 0 is assigned a positive probability density. Densities are computed with the exppdf function:
lambda = 2; % Failure rate t = 0:0.01:3; % Outcomes f = exppdf(t,1/lambda); % Probability density vector plot(t,f) % Visualize the probability distribution grid on

Probabilities for continuous pdfs can be computed with the quad function. In the example above, the probability of failure in the time interval [0, 1] is computed as follows:
f_lambda = @(t)exppdf(t,1/lambda); % Pdf with fixed lambda
P = quad(f_lambda,0,1) % Integrate from 0 to 1
P =
0.8647Alternatively, the cumulative distribution function (cdf) for the exponential function, expcdf, can be used:
P = expcdf(1,1/lambda) % Cumulative probability from 0 to 1
P =
0.8647A distribution of data can be described graphically with a histogram:
cars = load('carsmall','MPG','Origin');
MPG = cars.MPG;
hist(MPG)
set(get(gca,'Children'),'FaceColor',[.8 .8 1])

You can also describe a data distribution by estimating its density. The ksdensity function does this using a kernel smoothing method. A nonparametric density estimate of the previous data, using the default kernel and bandwidth, is given by:
[f,x] = ksdensity(MPG);
plot(x,f);
title('Density estimate for MPG')

Controlling Probability Density Curve Smoothness. The choice of kernel bandwidth controls the smoothness of the probability density curve. The following graph shows the density estimate for the same mileage data using different bandwidths. The default bandwidth is in blue and looks like the preceding graph. Estimates for smaller and larger bandwidths are in red and green.
The first call to ksdensity returns the default bandwidth, u, of the kernel smoothing function. Subsequent calls modify this bandwidth.
[f,x,u] = ksdensity(MPG);
plot(x,f)
title('Density estimate for MPG')
hold on
[f,x] = ksdensity(MPG,'width',u/3);
plot(x,f,'r');
[f,x] = ksdensity(MPG,'width',u*3);
plot(x,f,'g');
legend('default width','1/3 default','3*default')
hold off

The default bandwidth seems to be doing a good job—reasonably smooth, but not so smooth as to obscure features of the data. This bandwidth is the one that is theoretically optimal for estimating densities for the normal distribution.
The green curve shows a density with the kernel bandwidth set too high. This curve smooths out the data so much that the end result looks just like the kernel function. The red curve has a smaller bandwidth and is rougher looking than the blue curve. It may be too rough, but it does provide an indication that there might be two major peaks rather than the single peak of the blue curve. A reasonable choice of width might lead to a curve that is intermediate between the red and blue curves.
Specifying Kernel Smoothing Functions. You can also specify a kernel function by supplying either the function name or a function handle. The four preselected functions, 'normal', 'epanechnikov', 'box', and 'triangle', are all scaled to have standard deviation equal to 1, so they perform a comparable degree of smoothing.
Using default bandwidths, you can now plot the same mileage data, using each of the available kernel functions.
hname = {'normal' 'epanechnikov' 'box' 'triangle'};
hold on;
colors = {'r' 'b' 'g' 'm'};
for j=1:4
[f,x] = ksdensity(MPG,'kernel',hname{j});
plot(x,f,colors{j});
end
legend(hname{:});
hold off

The density estimates are roughly comparable, but the box kernel produces a density that is rougher than the others.
Comparing Density Estimates. While it is difficult to overlay two histograms to compare them, you can easily overlay smooth density estimates. For example, the following graph shows the MPG distributions for cars from different countries of origin:
Origin = cellstr(cars.Origin);
I = strcmp('USA',Origin);
J = strcmp('Japan',Origin);
K = ~(I|J);
MPG_USA = MPG(I);
MPG_Japan = MPG(J);
MPG_Europe = MPG(K);
[fI,xI] = ksdensity(MPG_USA);
plot(xI,fI,'b')
hold on
[fJ,xJ] = ksdensity(MPG_Japan);
plot(xJ,fJ,'r')
[fK,xK] = ksdensity(MPG_Europe);
plot(xK,fK,'g')
legend('USA','Japan','Europe')
hold off

For piecewise probability density estimation, using kernel smoothing in the center of the distribution and Pareto distributions in the tails, see Fitting Piecewise Distributions.
Cumulative distribution functions (cdfs) for supported Statistics Toolbox distributions all end with cdf, as in binocdf or expcdf. Specific function names for specific distributions can be found in Supported Distributions.
Each function represents a parametric family of distributions. Input arguments are arrays of outcomes followed by a list of parameter values specifying a particular member of the distribution family.
For discrete distributions, the cdf F is related to the pdf f by
![]()
For continuous distributions, the cdf F is related to the pdf f by
![]()
Cdfs are used to compute probabilities of events. In particular, if F is a cdf and x and y are outcomes, then
P(y ≤ x) = F(x)
P(y ≥ x) = 1 – F(x)
P(x1 ≤ y ≤ x2) = F(x2) – F(x1)
For example, the t-statistic
![]()
follows a Student's t distribution with n –
1 degrees of freedom when computed from repeated random samples from
a normal population with mean μ. Here
is the sample mean, s is
the sample standard deviation, and n is the sample
size. The probability of observing a t-statistic
greater than or equal to the value computed from a sample can be found
with the tcdf function:
mu = 1; % Population mean
sigma = 2; % Population standard deviation
n = 100; % Sample size
x = normrnd(mu,sigma,n,1); % Random sample from population
xbar = mean(x); % Sample mean
s = std(x); % Sample standard deviation
t = (xbar-mu)/(s/sqrt(n)) % t-statistic
t =
0.2489
p = 1-tcdf(t,n-1) % Probability of larger t-statistic
p =
0.4020This probability is the same as the p-value returned by a t-test of the null hypothesis that the sample comes from a normal population with mean μ:
[h,ptest] = ttest(x,mu,0.05,'right')
h =
0
ptest =
0.4020The ksdensity function produces an empirical version of a probability density function (pdf). That is, instead of selecting a density with a particular parametric form and estimating the parameters, it produces a nonparametric density estimate that adapts itself to the data.
Similarly, it is possible to produce an empirical version of
the cumulative distribution function (cdf). The ecdf function
computes this empirical cdf. It returns the values of a function
such that
represents
the proportion of observations in a sample less than or equal to
.
The idea behind the empirical cdf is simple. It is a function
that assigns probability
to each of
observations in a sample. Its graph
has a stair-step appearance. If a sample comes from a distribution
in a parametric family (such as a normal distribution), its empirical
cdf is likely to resemble the parametric distribution. If not, its
empirical distribution still gives an estimate of the cdf for the
distribution that generated the data.
The following example generates 20 observations from a normal distribution with mean 10 and standard deviation 2. You can use ecdf to calculate the empirical cdf and stairs to plot it. Then you overlay the normal distribution curve on the empirical function.
x = normrnd(10,2,20,1);
[f,xf] = ecdf(x);
stairs(xf,f)
hold on
xx=linspace(5,15,100);
yy = normcdf(xx,10,2);
plot(xx,yy,'r:')
hold off
legend('Empirical cdf','Normal cdf',2)

The empirical cdf is especially useful in survival analysis applications. In such applications the data may be censored, that is, not observed exactly. Some individuals may fail during a study, and you can observe their failure time exactly. Other individuals may drop out of the study, or may not fail until after the study is complete. The ecdf function has arguments for dealing with censored data. In addition, you can use the coxphfit function with individuals that have predictors that are not the same.
For piecewise probability density estimation, using the empirical cdf in the center of the distribution and Pareto distributions in the tails, see Fitting Piecewise Distributions.
Inverse cumulative distribution functions for supported Statistics Toolbox distributions all end with inv, as in binoinv or expinv. Specific function names for specific distributions can be found in Supported Distributions.
Each function represents a parametric family of distributions. Input arguments are arrays of cumulative probabilities between 0 and 1 followed by a list of parameter values specifying a particular member of the distribution family.
For continuous distributions, the inverse cdf returns the unique outcome whose cdf value is the input cumulative probability.
For example, the expinv function can be used to compute inverses of exponential cumulative probabilities:
x = 0.5:0.2:1.5 % Outcomes
x =
0.5000 0.7000 0.9000 1.1000 1.3000 1.5000
p = expcdf(x,1) % Cumulative probabilities
p =
0.3935 0.5034 0.5934 0.6671 0.7275 0.7769
expinv(p,1) % Return original outcomes
ans =
0.5000 0.7000 0.9000 1.1000 1.3000 1.5000For discrete distributions, there may be no outcome whose cdf value is the input cumulative probability. In these cases, the inverse cdf returns the first outcome whose cdf value equals or exceeds the input cumulative probability.
For example, the binoinv function can be used to compute inverses of binomial cumulative probabilities:
x = 0.5:0.2:1.5 % Outcomes
x =
0.5000 0.7000 0.9000 1.1000 1.3000 1.5000
p = binocdf(x,10,0.2) % Cumulative probabilities
p =
0.1074 0.1074 0.1074 0.3758 0.3758 0.3758
>> binoinv(p,10,0.2) % Return binomial outcomes
ans =
0 0 0 1 1 1The inverse cdf is useful in hypothesis testing, where critical outcomes of a test statistic are computed from cumulative significance probabilities. For example, norminv can be used to compute a 95% confidence interval under the assumption of normal variability:
p = [0.025 0.975]; % Interval containing 95% of [0,1]
x = norminv(p,0,1) % Assume standard normal variability
x =
-1.9600 1.9600 % 95% confidence interval
n = 20; % Sample size
y = normrnd(8,1,n,1); % Random sample (assume mean is unknown)
ybar = mean(y);
ci = ybar + (1/sqrt(n))*x % Confidence interval for mean
ci =
7.6779 8.5544 Distribution statistics functions for supported Statistics Toolbox distributions all end with stat, as in binostat or expstat. Specific function names for specific distributions can be found in Supported Distributions.
Each function represents a parametric family of distributions. Input arguments are lists of parameter values specifying a particular member of the distribution family. Functions return the mean and variance of the distribution, as a function of the parameters.
For example, the wblstat function can be used to visualize the mean of the Weibull distribution as a function of its two distribution parameters:
a = 0.5:0.1:3; b = 0.5:0.1:3; [A,B] = meshgrid(a,b); M = wblstat(A,B); surfc(A,B,M)

Distribution fitting functions for supported Statistics Toolbox distributions all end with fit, as in binofit or expfit. Specific function names for specific distributions can be found in Supported Distributions.
Each function represents a parametric family of distributions. Input arguments are arrays of data, presumed to be samples from some member of the selected distribution family. Functions return maximum likelihood estimates (MLEs) of distribution parameters, that is, parameters for the distribution family member with the maximum likelihood of producing the data as a random sample.
The Statistics Toolbox function mle is a convenient front end to the individual distribution fitting functions, and more. The function computes MLEs for distributions beyond those for which Statistics Toolbox software provides specific pdf functions.
For some pdfs, MLEs can be given in closed form and computed directly. For other pdfs, a search for the maximum likelihood must be employed. The search can be controlled with an options input argument, created using the statset function. For efficient searches, it is important to choose a reasonable distribution model and set appropriate convergence tolerances.
MLEs can be heavily biased, especially for small samples. As sample size increases, however, MLEs become unbiased minimum variance estimators with approximate normal distributions. This is used to compute confidence bounds for the estimates.
For example, consider the following distribution of means from repeated random samples of an exponential distribution:
mu = 1; % Population parameter n = 1e3; % Sample size ns = 1e4; % Number of samples samples = exprnd(mu,n,ns); % Population samples means = mean(samples); % Sample means
The Central Limit Theorem says that the means will be approximately normally distributed, regardless of the distribution of the data in the samples. The normfit function can be used to find the normal distribution that best fits the means:
[muhat,sigmahat,muci,sigmaci] = normfit(means)
muhat =
1.0003
sigmahat =
0.0319
muci =
0.9997
1.0010
sigmaci =
0.0314
0.0323The function returns MLEs for the mean and standard deviation and their 95% confidence intervals.
To visualize the distribution of sample means together with the fitted normal distribution, you must scale the fitted pdf, with area = 1, to the area of the histogram being used to display the means:
numbins = 50; hist(means,numbins) set(get(gca,'Children'),'FaceColor',[.8 .8 1]) hold on [bincounts,binpositions] = hist(means,numbins); binwidth = binpositions(2) - binpositions(1); histarea = binwidth*sum(bincounts); x = binpositions(1):0.001:binpositions(end); y = normpdf(x,muhat,sigmahat); plot(x,histarea*y,'r','LineWidth',2)

The parametric methods discussed in Fitting Regular Distributions fit data samples with smooth distributions that have a relatively low-dimensional set of parameters controlling their shape. These methods work well in many cases, but there is no guarantee that a given sample will be described accurately by any of the supported Statistics Toolbox distributions.
The empirical distributions computed by ecdf and discussed in Estimating Empirical CDFs assign equal probability to each observation in a sample, providing an exact match of the sample distribution. However, the distributions are not smooth, especially in the tails where data may be sparse.
The paretotails function fits a distribution by piecing together the empirical distribution in the center of the sample with smooth generalized Pareto distributions (GPDs) in the tails. The output is an object of the paretotails class, with associated methods to evaluate the cdf, inverse cdf, and other functions of the fitted distribution.
As an example, consider the following data, with about 20% outliers:
left_tail = -exprnd(1,10,1); right_tail = exprnd(5,10,1); center = randn(80,1); data = [left_tail;center;right_tail];
Neither a normal distribution nor a t distribution fits the tails very well:
probplot(data);
p = fitdist(data,'tlocationscale');
h = probplot(gca,p);
set(h,'color','r','linestyle','-')
title('{\bf Probability Plot}')
legend('Normal','Data','t','Location','NW')

On the other hand, the empirical distribution provides a perfect fit, but the outliers make the tails very discrete:
ecdf(data)

Random samples generated from this distribution by inversion might include, for example, values around 4.33 and 9.25, but nothing in-between.
The paretotails function provides a single, well-fit model for the entire sample. The following uses generalized Pareto distributions (GPDs) for the lower and upper 10% of the data:
pfit = paretotails(data,0.1,0.9)
pfit =
Piecewise distribution with 3 segments
-Inf < x < -1.30726 (0 < p < 0.1)
lower tail, GPD(-1.10167,1.12395)
-1.30726 < x < 1.27213 (0.1 < p < 0.9)
interpolated empirical cdf
1.27213 < x < Inf (0.9 < p < 1)
upper tail, GPD(1.03844,0.726038)
x = -4:0.01:10;
plot(x,cdf(pfit,x))

Access information about the fit using the methods of the paretotails class. Options allow for nonparametric estimation of the center of the cdf.
Negative log-likelihood functions for supported Statistics Toolbox distributions all end with like, as in explike. Specific function names for specific distributions can be found in Supported Distributions.
Each function represents a parametric family of distributions. Input arguments are lists of parameter values specifying a particular member of the distribution family followed by an array of data. Functions return the negative log-likelihood of the parameters, given the data.
Negative log-likelihood functions are used as objective functions in search algorithms such as the one implemented by the MATLAB function fminsearch. Additional search algorithms are implemented by Optimization Toolbox™ functions and Genetic Algorithm and Direct Search Toolbox™ functions.
When used to compute maximum likelihood estimates (MLEs), negative log-likelihood functions allow you to choose a search algorithm and exercise low-level control over algorithm execution. By contrast, the functions discussed in Distribution Fitting Functions use preset algorithms with options limited to those set by the statset function.
Likelihoods are conditional probability densities. A parametric family of distributions is specified by its pdf f (x,a), where x and a represent the variables and parameters, respectively. When a is fixed, the pdf is used to compute the density at x, f (x|a). When x is fixed, the pdf is used to compute the likelihood of the parameters a, f (a|x). The joint likelihood of the parameters over an independent random sample X is
![]()
Given X, MLEs maximize L(a) over all possible a.
In numerical algorithms, the log-likelihood function, log(L(a)), is (equivalently) optimized. The logarithm transforms the product of potentially small likelihoods into a sum of logs, which is easier to distinguish from 0 in computation. For convenience, Statistics Toolbox negative log-likelihood functions return the negative of this sum, since the optimization algorithms to which the values are passed typically search for minima rather than maxima.
For example, use gamrnd to generate a random sample from a specific gamma distribution:
a = [1,2]; X = gamrnd(a(1),a(2),1e3,1);
Given X, the gamlike function can be used to visualize the likelihood surface in the neighborhood of a:
mesh = 50;
delta = 0.5;
a1 = linspace(a(1)-delta,a(1)+delta,mesh);
a2 = linspace(a(2)-delta,a(2)+delta,mesh);
logL = zeros(mesh); % Preallocate memory
for i = 1:mesh
for j = 1:mesh
logL(i,j) = gamlike([a1(i),a2(j)],X);
end
end
[A1,A2] = meshgrid(a1,a2);
surfc(A1,A2,logL)

The MATLAB function fminsearch is used to search for the minimum of the likelihood surface:
LL = @(u)gamlike([u(1),u(2)],X); % Likelihood given X
MLES = fminsearch(LL,[1,2])
MLES =
1.0231 1.9729These can be compared to the MLEs returned by the gamfit function, which uses a combination search and solve algorithm:
ahat = gamfit(X)
ahat =
1.0231 1.9728The MLEs can be added to the surface plot (rotated to show the minimum):
hold on
plot3(MLES(1),MLES(2),LL(MLES),...
'ro','MarkerSize',5,...
'MarkerFaceColor','r')

The Statistics Toolbox supports the generation of random numbers from various distributions. Each RNG represents a parametric family of distributions. RNGs return random numbers from the specified distribution in an array of the specified dimensions. Specific RNG names for specific distributions are in Supported Distributions.
Other random number generation functions which do not support specific distributions include:
RNGs in Statistics Toolbox software depend on MATLAB's default random number stream via the rand and randn functions, each RNG uses one of the techniques discussed in Common Generation Methods to generate random numbers from a given distribution.
By controlling the default random number stream and its state, you can control how the RNGs in Statistics Toolbox software generate random values. For example, to reproduce the same sequence of values from an RNG, you can save and restore the default stream's state, or reset the default stream. For details on managing the default random number stream, see Managing the Default Stream.
MATLAB initializes the default random number stream to the same state each time it starts up. Thus, RNGs in Statistics Toolbox software will generate the same sequence of values for each MATLAB session unless you modify that state at startup. One simple way to do that is to add commands to startup.m such as
stream = RandStream('mt19937ar','seed',sum(100*clock));
RandStream.setDefaultStream(stream);
that initialize MATLAB's default random number stream to a different state for each session.
The following table lists the dependencies of Statistics Toolbox RNGs on the MATLAB base RNGs rand and/or randn.
| RNG | MATLAB Base RNG |
|---|---|
rand, randn | |
rand | |
rand, randn | |
rand | |
rand | |
rand, randn | |
rand, randn | |
rand | |
rand | |
rand | |
rand | |
rand, randn | |
randn | |
rand | |
rand | |
randn | |
rand or randn, depending on the RNG given for the proposal distribution | |
randn | |
rand, randn | |
rand, randn | |
rand, randn | |
rand, randn | |
randn | |
randn | |
rand or randn, depending on the distribution type | |
rand, randn | |
rand or randn, depending on the specified distribution | |
rand | |
randn | |
rand | |
rand, randn | |
rand | |
rand | |
rand | |
rand, randn |
![]() | Working with Distributions Through GUIs | Using Probability Distribution Objects | ![]() |

Includes the most popular MATLAB recorded presentations with Q&A sessions led by MATLAB experts.
| © 1984-2009- The MathWorks, Inc. - Site Help - Patents - Trademarks - Privacy Policy - Preventing Piracy - RSS |