| Statistics Toolbox™ | ![]() |
| On this page… |
|---|
Cumulative Distribution Functions Inverse Cumulative Distribution Functions Distribution Statistics Functions Distribution Fitting Functions |
For each distribution supported by Statistics Toolbox™ software, a selection of the distribution functions described in this section is available for statistical programming. This section gives a general overview of the use of each type of function, independent of the particular distribution. For specific functions available for specific distributions, see Supported Distributions.
Probability density functions (pdfs) for supported Statistics Toolbox distributions all end with pdf, as in binopdf or exppdf. Specific function names for specific distributions can be found in Supported Distributions.
Each function represents a parametric family of distributions. Input arguments are arrays of outcomes followed by a list of parameter values specifying a particular member of the distribution family.
For discrete distributions, the pdf assigns a probability to each outcome. In this context, the pdf is often called a probability mass function (pmf).
For example, the discrete binomial pdf
![]()
assigns probability to the event of k successes in n trials of a Bernoulli process (such as coin flipping) with probability p of success at each trial. Each of the integers k = 0, 1, 2, ..., n is assigned a positive probability, with the sum of the probabilities equal to 1. The probabilities are computed with the binopdf function:
p = 0.2; % Probability of success for each trial n = 10; % Number of trials k = 0:n; % Outcomes m = binopdf(k,n,p); % Probability mass vector bar(k,m) % Visualize the probability distribution set(get(gca,'Children'),'FaceColor',[.8 .8 1]) grid on

For continuous distributions, the pdf assigns a probability density to each outcome. The probability of any single outcome is zero. The pdf must be integrated over a set of outcomes to compute the probability that an outcome falls within that set. The integral over the entire set of outcomes is 1.
For example, the continuous exponential pdf
![]()
is used to model the probability that a process with constant failure rate λ will have a failure within time t . Each time t > 0 is assigned a positive probability density. Densities are computed with the exppdf function:
lambda = 2; % Failure rate t = 0:0.01:3; % Outcomes f = exppdf(t,1/lambda); % Probability density vector plot(t,f) % Visualize the probability distribution grid on

Probabilities for continuous pdfs can be computed with the quad function. In the example above, the probability of failure in the time interval [0, 1] is computed as follows:
f_lambda = @(t)exppdf(t,1/lambda); % Pdf with fixed lambda
P = quad(f_lambda,0,1) % Integrate from 0 to 1
P =
0.8647Alternatively, the cumulative distribution function (cdf) for the exponential function, expcdf, can be used:
P = expcdf(1,1/lambda) % Cumulative probability from 0 to 1
P =
0.8647Introduction. A distribution of data can be described graphically with a histogram:
cars = load('carsmall','MPG','Origin');
MPG = cars.MPG;
hist(MPG)
set(get(gca,'Children'),'FaceColor',[.8 .8 1])

You can also describe a data distribution by estimating its density. The ksdensity function does this using a kernel smoothing method. A nonparametric density estimate of the data above, using the default kernel and bandwidth, is given by:
[f,x] = ksdensity(MPG);
plot(x,f);
title('Density estimate for MPG')

Kernel Bandwidth. The choice of kernel bandwidth controls the smoothness of the probability density curve. The following graph shows the density estimate for the same mileage data using different bandwidths. The default bandwidth is in blue and looks like the preceding graph. Estimates for smaller and larger bandwidths are in red and green.
The first call to ksdensity returns the default bandwidth, u, of the kernel smoothing function. Subsequent calls modify this bandwidth.
[f,x,u] = ksdensity(MPG);
plot(x,f)
title('Density estimate for MPG')
hold on
[f,x] = ksdensity(MPG,'width',u/3);
plot(x,f,'r');
[f,x] = ksdensity(MPG,'width',u*3);
plot(x,f,'g');
legend('default width','1/3 default','3*default')
hold off

The default bandwidth seems to be doing a good job—reasonably smooth, but not so smooth as to obscure features of the data. This bandwidth is the one that is theoretically optimal for estimating densities for the normal distribution.
The green curve shows a density with the kernel bandwidth set too high. This curve smooths out the data so much that the end result looks just like the kernel function. The red curve has a smaller bandwidth and is rougher looking than the blue curve. It may be too rough, but it does provide an indication that there might be two major peaks rather than the single peak of the blue curve. A reasonable choice of width might lead to a curve that is intermediate between the red and blue curves.
Kernel Smoothing. You can also specify a kernel function by supplying either the function name or a function handle. The four preselected functions, 'normal', 'epanechnikov', 'box', and 'triangle', are all scaled to have standard deviation equal to 1, so they perform a comparable degree of smoothing.
Using default bandwidths, you can now plot the same mileage data, using each of the available kernel functions.
hname = {'normal' 'epanechnikov' 'box' 'triangle'};
hold on;
colors = {'r' 'b' 'g' 'm'};
for j=1:4
[f,x] = ksdensity(MPG,'kernel',hname{j});
plot(x,f,colors{j});
end
legend(hname{:});
hold off

The density estimates are roughly comparable, but the box kernel produces a density that is rougher than the others.
Using Density Estimates. While it is difficult to overlay two histograms to compare them, you can easily overlay smooth density estimates. For example, the following graph shows the MPG distributions for cars from different countries of origin:
Origin = cellstr(cars.Origin);
I = strcmp('USA',Origin);
J = strcmp('Japan',Origin);
K = ~(I|J);
MPG_USA = MPG(I);
MPG_Japan = MPG(J);
MPG_Europe = MPG(K);
[fI,xI] = ksdensity(MPG_USA);
plot(xI,fI,'b')
hold on
[fJ,xJ] = ksdensity(MPG_Japan);
plot(xJ,fJ,'r')
[fK,xK] = ksdensity(MPG_Europe);
plot(xK,fK,'g')
legend('USA','Japan','Europe')
hold off

For piecewise probability density estimation, using kernel smoothing in the center of the distribution and Pareto distributions in the tails, see Fitting Piecewise Distributions.
Cumulative distribution functions (cdfs) for supported Statistics Toolbox distributions all end with cdf, as in binocdf or expcdf. Specific function names for specific distributions can be found in Supported Distributions.
Each function represents a parametric family of distributions. Input arguments are arrays of outcomes followed by a list of parameter values specifying a particular member of the distribution family.
For discrete distributions, the cdf F is related to the pdf f by
![]()
For continuous distributions, the cdf F is related to the pdf f by
![]()
Cdfs are used to compute probabilities of events. In particular, if F is a cdf and x and y are outcomes, then
P(y ≤ x) = F(x)
P(y ≥ x) = 1 – F(x)
P(x1 ≤ y ≤ x2) = F(x2) – F(x1)
For example, the t-statistic
![]()
follows a Student's t distribution with n – 1 degrees of freedom when computed
from repeated random samples from a normal population with mean μ. Here
is
the sample mean, s is the sample standard deviation,
and n is the sample size. The probability of
observing a t-statistic greater than or equal
to the value computed from a sample can be found with the tcdf function:
mu = 1; % Population mean
sigma = 2; % Population standard deviation
n = 100; % Sample size
x = normrnd(mu,sigma,n,1); % Random sample from population
xbar = mean(x); % Sample mean
s = std(x); % Sample standard deviation
t = (xbar-mu)/(s/sqrt(n)) % t-statistic
t =
0.2489
p = 1-tcdf(t,n-1) % Probability of larger t-statistic
p =
0.4020This probability is the same as the p-value returned by a t-test of the null hypothesis that the sample comes from a normal population with mean μ:
[h,ptest] = ttest(x,mu,0.05,'right')
h =
0
ptest =
0.4020The ksdensity function produces an empirical version of a probability density function (pdf). That is, instead of selecting a density with a particular parametric form and estimating the parameters, it produces a nonparametric density estimate that adapts itself to the data.
Similarly, it is possible to produce an empirical version of
the cumulative distribution function (cdf). The ecdf function computes this empirical cdf. It returns
the values of a function
such that
represents the proportion of observations in
a sample less than or equal to
.
The idea behind the empirical cdf is simple. It is a function
that assigns probability
to each of
observations
in a sample. Its graph has a stair-step appearance. If a sample comes
from a distribution in a parametric family (such as a normal distribution),
its empirical cdf is likely to resemble the parametric distribution.
If not, its empirical distribution still gives an estimate of the
cdf for the distribution that generated the data.
The following example generates 20 observations from a normal distribution with mean 10 and standard deviation 2. You can use ecdf to calculate the empirical cdf and stairs to plot it. Then you overlay the normal distribution curve on the empirical function.
x = normrnd(10,2,20,1);
[f,xf] = ecdf(x);
stairs(xf,f)
hold on
xx=linspace(5,15,100);
yy = normcdf(xx,10,2);
plot(xx,yy,'r:')
hold off
legend('Empirical cdf','Normal cdf',2)

The empirical cdf is especially useful in survival analysis applications. In such applications the data may be censored, that is, not observed exactly. Some individuals may fail during a study, and you can observe their failure time exactly. Other individuals may drop out of the study, or may not fail until after the study is complete. The ecdf function has arguments for dealing with censored data. In addition, you can use the coxphfit function with individuals that have predictors that are not the same.
For piecewise probability density estimation, using the empirical cdf in the center of the distribution and Pareto distributions in the tails, see Fitting Piecewise Distributions.
Inverse cumulative distribution functions for supported Statistics Toolbox distributions all end with inv, as in binoinv or expinv. Specific function names for specific distributions can be found in Supported Distributions.
Each function represents a parametric family of distributions. Input arguments are arrays of cumulative probabilities between 0 and 1 followed by a list of parameter values specifying a particular member of the distribution family.
For continuous distributions, the inverse cdf returns the unique outcome whose cdf value is the input cumulative probability.
For example, the expinv function can be used to compute inverses of exponential cumulative probabilities:
x = 0.5:0.2:1.5 % Outcomes
x =
0.5000 0.7000 0.9000 1.1000 1.3000 1.5000
p = expcdf(x,1) % Cumulative probabilities
p =
0.3935 0.5034 0.5934 0.6671 0.7275 0.7769
expinv(p,1) % Return original outcomes
ans =
0.5000 0.7000 0.9000 1.1000 1.3000 1.5000For discrete distributions, there may be no outcome whose cdf value is the input cumulative probability. In these cases, the inverse cdf returns the first outcome whose cdf value equals or exceeds the input cumulative probability.
For example, the binoinv function can be used to compute inverses of binomial cumulative probabilities:
x = 0.5:0.2:1.5 % Outcomes
x =
0.5000 0.7000 0.9000 1.1000 1.3000 1.5000
p = binocdf(x,10,0.2) % Cumulative probabilities
p =
0.1074 0.1074 0.1074 0.3758 0.3758 0.3758
>> binoinv(p,10,0.2) % Return binomial outcomes
ans =
0 0 0 1 1 1The inverse cdf is useful in hypothesis testing, where critical outcomes of a test statistic are computed from cumulative significance probabilities. For example, norminv can be used to compute a 95% confidence interval under the assumption of normal variability:
p = [0.025 0.975]; % Interval containing 95% of [0,1]
x = norminv(p,0,1) % Assume standard normal variability
x =
-1.9600 1.9600 % 95% confidence interval
n = 20; % Sample size
y = normrnd(8,1,n,1); % Random sample (assume mean is unknown)
ybar = mean(y);
ci = ybar + (1/sqrt(n))*x % Confidence interval for mean
ci =
7.6779 8.5544 Distribution statistics functions for supported Statistics Toolbox distributions all end with stat, as in binostat or expstat. Specific function names for specific distributions can be found in Supported Distributions.
Each function represents a parametric family of distributions. Input arguments are lists of parameter values specifying a particular member of the distribution family. Functions return the mean and variance of the distribution, as a function of the parameters.
For example, the wblstat function can be used to visualize the mean of the Weibull distribution as a function of its two distribution parameters:
a = 0.5:0.1:3; b = 0.5:0.1:3; [A,B] = meshgrid(a,b); M = wblstat(A,B); surfc(A,B,M)

Distribution fitting functions for supported Statistics Toolbox distributions all end with fit, as in binofit or expfit. Specific function names for specific distributions can be found in Supported Distributions.
Each function represents a parametric family of distributions. Input arguments are arrays of data, presumed to be samples from some member of the selected distribution family. Functions return maximum likelihood estimates (MLEs) of distribution parameters, that is, parameters for the distribution family member with the maximum likelihood of producing the data as a random sample.
The Statistics Toolbox function mle is a convenient front end to the individual distribution fitting functions, and more. The function computes MLEs for distributions beyond those for which Statistics Toolbox software provides specific pdf functions.
For some pdfs, MLEs can be given in closed form and computed directly. For other pdfs, a search for the maximum likelihood must be employed. The search can be controlled with an options input argument, created using the statset function. For efficient searches, it is important to choose a reasonable distribution model and set appropriate convergence tolerances.
MLEs can be heavily biased, especially for small samples. As sample size increases, however, MLEs become unbiased minimum variance estimators with approximate normal distributions. This is used to compute confidence bounds for the estimates.
For example, consider the following distribution of means from repeated random samples of an exponential distribution:
mu = 1; % Population parameter n = 1e3; % Sample size ns = 1e4; % Number of samples samples = exprnd(mu,n,ns); % Population samples means = mean(samples); % Sample means
The Central Limit Theorem says that the means will be approximately normally distributed, regardless of the distribution of the data in the samples. The normfit function can be used to find the normal distribution that best fits the means:
[muhat,sigmahat,muci,sigmaci] = normfit(means)
muhat =
1.0003
sigmahat =
0.0319
muci =
0.9997
1.0010
sigmaci =
0.0314
0.0323The function returns MLEs for the mean and standard deviation and their 95% confidence intervals.
To visualize the distribution of sample means together with the fitted normal distribution, you must scale the fitted pdf, with area = 1, to the area of the histogram being used to display the means:
numbins = 50; hist(means,numbins) set(get(gca,'Children'),'FaceColor',[.8 .8 1]) hold on [bincounts,binpositions] = hist(means,numbins); binwidth = binpositions(2) - binpositions(1); histarea = binwidth*sum(bincounts); x = binpositions(1):0.001:binpositions(end); y = normpdf(x,muhat,sigmahat); plot(x,histarea*y,'r','LineWidth',2)

The parametric methods discussed in Fitting Supported Distributions fit data samples with smooth distributions that have a relatively low-dimensional set of parameters controlling their shape. These methods work well in many cases, but there is no guarantee that a given sample will be described accurately by any of the supported Statistics Toolbox distributions.
The empirical distributions computed by ecdf and discussed in Nonparametric Estimation assign equal probability to each observation in a sample, providing an exact match of the sample distribution. However, the distributions are not smooth, especially in the tails where data may be sparse.
The paretotails function fits a distribution by piecing together the empirical distribution in the center of the sample with smooth generalized Pareto distributions (GPDs) in the tails. The output is an object of the @paretotails class, with associated methods to evaluate the cdf, inverse cdf, and other functions of the fitted distribution.
As an example, consider the following data, with about 20% outliers:
left_tail = -exprnd(1,10,1); right_tail = exprnd(5,10,1); center = randn(80,1); data = [left_tail;center;right_tail];
Neither a normal distribution nor a t distribution fits the tails very well:
probplot(data);
p = mle(data,'dist','tlo');
t = @(data,mu,sig,df)cdf('tlocationscale',data,mu,sig,df);
h = probplot(gca,t,p);
set(h,'color','r','linestyle','-')
title('{\bf Probability Plot}')
legend('Data','Normal','t','Location','NW')

On the other hand, the empirical distribution provides a perfect fit, but the outliers make the tails very discrete:
ecdf(data)

Random samples generated from this distribution by inversion might include, for example, values around 4.33 and 9.25, but nothing in-between.
The paretotails function provides a single, well-fit model for the entire sample. The following uses generalized Pareto distributions (GPDs) for the lower and upper 10% of the data:
pfit = paretotails(data,0.1,0.9)
pfit =
Piecewise distribution with 3 segments
-Inf < x < -1.30726 (0 < p < 0.1)
lower tail, GPD(-1.10167,1.12395)
-1.30726 < x < 1.27213 (0.1 < p < 0.9)
interpolated empirical cdf
1.27213 < x < Inf (0.9 < p < 1)
upper tail, GPD(1.03844,0.726038)
x = -4:0.01:10;
plot(x,cdf(pfit,x))

Access information about the fit using the methods of the @paretotails class. Options allow for nonparametric estimation of the center of the cdf.
Negative log-likelihood functions for supported Statistics Toolbox distributions all end with like, as in explike. Specific function names for specific distributions can be found in Supported Distributions.
Each function represents a parametric family of distributions. Input arguments are lists of parameter values specifying a particular member of the distribution family followed by an array of data. Functions return the negative log-likelihood of the parameters, given the data.
Negative log-likelihood functions are used as objective functions in search algorithms such as the one implemented by the MATLAB® function fminsearch. Additional search algorithms are implemented by Optimization Toolbox™ functions and Genetic Algorithm and Direct Search Toolbox™ functions.
When used to compute maximum likelihood estimates (MLEs), negative log-likelihood functions allow you to choose a search algorithm and exercise low-level control over algorithm execution. By contrast, the functions discussed in Distribution Fitting Functions use preset algorithms with options limited to those set by the statset function.
Likelihoods are conditional probability densities. A parametric family of distributions is specified by its pdf f (x,a), where x and a represent the variables and parameters, respectively. When a is fixed, the pdf is used to compute the density at x, f (x|a). When x is fixed, the pdf is used to compute the likelihood of the parameters a, f (a|x). The joint likelihood of the parameters over an independent random sample X is
![]()
Given X, MLEs maximize L(a) over all possible a.
In numerical algorithms, the log-likelihood function, log(L(a)), is (equivalently) optimized. The logarithm transforms the product of potentially small likelihoods into a sum of logs, which is easier to distinguish from 0 in computation. For convenience, Statistics Toolbox negative log-likelihood functions return the negative of this sum, since the optimization algorithms to which the values are passed typically search for minima rather than maxima.
For example, use gamrnd to generate a random sample from a specific gamma distribution:
a = [1,2]; X = gamrnd(a(1),a(2),1e3,1);
Given X, the gamlike function can be used to visualize the likelihood surface in the neighborhood of a:
mesh = 50;
delta = 0.5;
a1 = linspace(a(1)-delta,a(1)+delta,mesh);
a2 = linspace(a(2)-delta,a(2)+delta,mesh);
logL = zeros(mesh); % Preallocate memory
for i = 1:mesh
for j = 1:mesh
logL(i,j) = gamlike([a1(i),a2(j)],X);
end
end
[A1,A2] = meshgrid(a1,a2);
surfc(A1,A2,logL)

The MATLAB function fminsearch is used to search for the minimum of the likelihood surface:
LL = @(u)gamlike([u(1),u(2)],X); % Likelihood given X
MLES = fminsearch(LL,[1,2])
MLES =
1.0231 1.9729These can be compared to the MLEs returned by the gamfit function, which uses a combination search and solve algorithm:
ahat = gamfit(X)
ahat =
1.0231 1.9728The MLEs can be added to the surface plot (rotated to show the minimum):
hold on
plot3(MLES(1),MLES(2),LL(MLES),...
'ro','MarkerSize',5,...
'MarkerFaceColor','r')

Random number generators (RNGs) for supported Statistics Toolbox distributions all end with rnd, as in binornd or exprnd. Specific RNG names for specific distributions can be found in Supported Distributions.
Each RNG represents a parametric family of distributions. Input arguments are lists of parameter values specifying a particular member of the distribution family followed by the dimensions of an array. RNGs return random numbers from the specified distribution in an array of the specified dimensions.
RNGs in Statistics Toolbox software depend on the MATLAB base generators rand and/or randn, using the techniques discussed in Common Generation Methods to generate random numbers from particular distributions. Dependencies of specific RNGs are listed in the table below.
MATLAB resets the state of the base RNGs each time it is started. Thus, by default, dependent RNGs in Statistics Toolbox software will generate the same sequence of values with each MATLAB session. To change this behavior, the state of the base RNGs must be set explicitly in startup.m or in the relevant program code. States can be set to fixed values, for reproducibility, or to time-dependent values, to assure new random sequences. For details on setting the state and the method used by the base RNGs, see rand and randn.
For example, to simulate flips of a biased coin:
p = 0.55; % Probability of heads
n = 100; % Number of flips per trial
N = 1e3; % Number of trials
rand('state',sum(100*clock)) % Set base generator
sims = unifrnd(0,1,n,N) < p; % 1 for heads; 0 for tails
The empirical probability of heads for each trial is given by:
phat = sum(sims)/n;
The probability of heads for each trial can also be simulated by:
prand = binornd(n,p,1,N)/n;
You can compare the two simulations with a histogram:
hist([phat' prand'])
h = get(gca,'Children');
set(h(1),'FaceColor',[.8 .8 1])
legend('UNIFRND','BINORND')

The following table lists the dependencies of Statistics Toolbox RNGs on the MATLAB base RNGs rand and/or randn. Set the states and methods of the RNGs in the right-hand column to assure reproducibility/variability of the outputs of the RNGs in the left-hand column.
| RNG | MATLAB Base RNG |
|---|---|
rand, randn | |
rand | |
rand, randn | |
rand | |
rand | |
rand, randn | |
rand, randn | |
rand | |
rand | |
rand | |
rand | |
rand, randn | |
randn | |
rand | |
rand | |
randn | |
rand or randn, depending on the RNG given for the proposal distribution | |
randn | |
rand, randn | |
rand, randn | |
rand, randn | |
rand, randn | |
randn | |
randn | |
rand or randn, depending on the distribution type | |
rand, randn | |
rand or randn, depending on the specified distribution | |
rand | |
randn | |
rand | |
rand, randn | |
rand | |
rand | |
rand | |
rand, randn |
![]() | Supported Distributions | Distribution GUIs | ![]() |
| © 1984-2008- The MathWorks, Inc. - Site Help - Patents - Trademarks - Privacy Policy - Preventing Piracy - RSS |