Products & Services Solutions Academia Support User Community Company

Learn more about Statistics Toolbox   

Statistics Toolbox Distribution Functions

For each distribution supported by Statistics Toolbox software, a selection of the distribution functions described in this section is available for statistical programming. This section gives a general overview of the use of each type of function, independent of the particular distribution. For specific functions available for specific distributions, see Supported Distributions.

Probability Density Functions

Estimating PDFs with Parameters

Probability density functions (pdfs) for supported Statistics Toolbox distributions all end with pdf, as in binopdf or exppdf. For more information on specific function names for specific distributions see Supported Distributions.

Each function represents a parametric family of distributions. Input arguments are arrays of outcomes followed by a list of parameter values specifying a particular member of the distribution family.

For discrete distributions, the pdf assigns a probability to each outcome. In this context, the pdf is often called a probability mass function (pmf).

For example, the discrete binomial pdf

assigns probability to the event of k successes in n trials of a Bernoulli process (such as coin flipping) with probability p of success at each trial. Each of the integers k = 0, 1, 2, ..., n is assigned a positive probability, with the sum of the probabilities equal to 1. Compute the probabilities with the binopdf function:

p = 0.2; % Probability of success for each trial
n = 10; % Number of trials
k = 0:n; % Outcomes
m = binopdf(k,n,p); % Probability mass vector
bar(k,m) % Visualize the probability distribution
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
grid on

For continuous distributions, the pdf assigns a probability density to each outcome. The probability of any single outcome is zero. The pdf must be integrated over a set of outcomes to compute the probability that an outcome falls within that set. The integral over the entire set of outcomes is 1.

For example, the continuous exponential pdf

is used to model the probability that a process with constant failure rate λ will have a failure within time t . Each time t > 0 is assigned a positive probability density. Densities are computed with the exppdf function:

lambda = 2; % Failure rate
t = 0:0.01:3; % Outcomes
f = exppdf(t,1/lambda); % Probability density vector
plot(t,f) % Visualize the probability distribution
grid on

Probabilities for continuous pdfs can be computed with the quad function. In the example above, the probability of failure in the time interval [0, 1] is computed as follows:

f_lambda = @(t)exppdf(t,1/lambda); % Pdf with fixed lambda
P = quad(f_lambda,0,1) % Integrate from 0 to 1
P =
    0.8647

Alternatively, the cumulative distribution function (cdf) for the exponential function, expcdf, can be used:

P = expcdf(1,1/lambda) % Cumulative probability from 0 to 1
P =
    0.8647

Estimating PDFs without Parameters

A distribution of data can be described graphically with a histogram:

cars = load('carsmall','MPG','Origin'); 
MPG = cars.MPG;
hist(MPG)
set(get(gca,'Children'),'FaceColor',[.8 .8 1])

You can also describe a data distribution by estimating its density. The ksdensity function does this using a kernel smoothing method. A nonparametric density estimate of the previous data, using the default kernel and bandwidth, is given by:

[f,x] = ksdensity(MPG); 
plot(x,f); 
title('Density estimate for MPG') 

Controlling Probability Density Curve Smoothness.   The choice of kernel bandwidth controls the smoothness of the probability density curve. The following graph shows the density estimate for the same mileage data using different bandwidths. The default bandwidth is in blue and looks like the preceding graph. Estimates for smaller and larger bandwidths are in red and green.

The first call to ksdensity returns the default bandwidth, u, of the kernel smoothing function. Subsequent calls modify this bandwidth.

[f,x,u] = ksdensity(MPG);
plot(x,f)
title('Density estimate for MPG')
hold on

[f,x] = ksdensity(MPG,'width',u/3);
plot(x,f,'r');

[f,x] = ksdensity(MPG,'width',u*3);
plot(x,f,'g');

legend('default width','1/3 default','3*default')
hold off

The default bandwidth seems to be doing a good job—reasonably smooth, but not so smooth as to obscure features of the data. This bandwidth is the one that is theoretically optimal for estimating densities for the normal distribution.

The green curve shows a density with the kernel bandwidth set too high. This curve smooths out the data so much that the end result looks just like the kernel function. The red curve has a smaller bandwidth and is rougher looking than the blue curve. It may be too rough, but it does provide an indication that there might be two major peaks rather than the single peak of the blue curve. A reasonable choice of width might lead to a curve that is intermediate between the red and blue curves.

Specifying Kernel Smoothing Functions.   You can also specify a kernel function by supplying either the function name or a function handle. The four preselected functions, 'normal', 'epanechnikov', 'box', and 'triangle', are all scaled to have standard deviation equal to 1, so they perform a comparable degree of smoothing.

Using default bandwidths, you can now plot the same mileage data, using each of the available kernel functions.

hname = {'normal' 'epanechnikov' 'box' 'triangle'};
hold on;
colors = {'r' 'b' 'g' 'm'};
for j=1:4
    [f,x] = ksdensity(MPG,'kernel',hname{j});
    plot(x,f,colors{j});
end
legend(hname{:});
hold off

The density estimates are roughly comparable, but the box kernel produces a density that is rougher than the others.

Comparing Density Estimates.   While it is difficult to overlay two histograms to compare them, you can easily overlay smooth density estimates. For example, the following graph shows the MPG distributions for cars from different countries of origin:

Origin = cellstr(cars.Origin);

I = strcmp('USA',Origin);
J = strcmp('Japan',Origin);
K = ~(I|J);
MPG_USA = MPG(I);
MPG_Japan = MPG(J);
MPG_Europe = MPG(K);

[fI,xI] = ksdensity(MPG_USA);
plot(xI,fI,'b')
hold on

[fJ,xJ] = ksdensity(MPG_Japan);
plot(xJ,fJ,'r')

[fK,xK] = ksdensity(MPG_Europe);
plot(xK,fK,'g')

legend('USA','Japan','Europe')
hold off

For piecewise probability density estimation, using kernel smoothing in the center of the distribution and Pareto distributions in the tails, see Fitting Piecewise Distributions.

Cumulative Distribution Functions

Estimating Parametric CDFs

Cumulative distribution functions (cdfs) for supported Statistics Toolbox distributions all end with cdf, as in binocdf or expcdf. Specific function names for specific distributions can be found in Supported Distributions.

Each function represents a parametric family of distributions. Input arguments are arrays of outcomes followed by a list of parameter values specifying a particular member of the distribution family.

For discrete distributions, the cdf F is related to the pdf f by

For continuous distributions, the cdf F is related to the pdf f by

Cdfs are used to compute probabilities of events. In particular, if F is a cdf and x and y are outcomes, then

For example, the t-statistic

follows a Student's t distribution with n – 1 degrees of freedom when computed from repeated random samples from a normal population with mean μ. Here is the sample mean, s is the sample standard deviation, and n is the sample size. The probability of observing a t-statistic greater than or equal to the value computed from a sample can be found with the tcdf function:

mu = 1; % Population mean
sigma = 2; % Population standard deviation
n = 100; % Sample size
x = normrnd(mu,sigma,n,1); % Random sample from population
xbar = mean(x); % Sample mean
s = std(x); % Sample standard deviation
t = (xbar-mu)/(s/sqrt(n)) % t-statistic
t =
    0.2489
p = 1-tcdf(t,n-1) % Probability of larger t-statistic
p =
    0.4020

This probability is the same as the p-value returned by a t-test of the null hypothesis that the sample comes from a normal population with mean μ:

[h,ptest] = ttest(x,mu,0.05,'right')
h =
     0
ptest =
    0.4020

Estimating Empirical CDFs

The ksdensity function produces an empirical version of a probability density function (pdf). That is, instead of selecting a density with a particular parametric form and estimating the parameters, it produces a nonparametric density estimate that adapts itself to the data.

Similarly, it is possible to produce an empirical version of the cumulative distribution function (cdf). The ecdf function computes this empirical cdf. It returns the values of a function such that represents the proportion of observations in a sample less than or equal to .

The idea behind the empirical cdf is simple. It is a function that assigns probability to each of observations in a sample. Its graph has a stair-step appearance. If a sample comes from a distribution in a parametric family (such as a normal distribution), its empirical cdf is likely to resemble the parametric distribution. If not, its empirical distribution still gives an estimate of the cdf for the distribution that generated the data.

The following example generates 20 observations from a normal distribution with mean 10 and standard deviation 2. You can use ecdf to calculate the empirical cdf and stairs to plot it. Then you overlay the normal distribution curve on the empirical function.

x = normrnd(10,2,20,1);
[f,xf] = ecdf(x);

stairs(xf,f)
hold on
xx=linspace(5,15,100);
yy = normcdf(xx,10,2);
plot(xx,yy,'r:')
hold off
legend('Empirical cdf','Normal cdf',2)

The empirical cdf is especially useful in survival analysis applications. In such applications the data may be censored, that is, not observed exactly. Some individuals may fail during a study, and you can observe their failure time exactly. Other individuals may drop out of the study, or may not fail until after the study is complete. The ecdf function has arguments for dealing with censored data. In addition, you can use the coxphfit function with individuals that have predictors that are not the same.

For piecewise probability density estimation, using the empirical cdf in the center of the distribution and Pareto distributions in the tails, see Fitting Piecewise Distributions.

Inverse Cumulative Distribution Functions

Inverse cumulative distribution functions for supported Statistics Toolbox distributions all end with inv, as in binoinv or expinv. Specific function names for specific distributions can be found in Supported Distributions.

Each function represents a parametric family of distributions. Input arguments are arrays of cumulative probabilities between 0 and 1 followed by a list of parameter values specifying a particular member of the distribution family.

For continuous distributions, the inverse cdf returns the unique outcome whose cdf value is the input cumulative probability.

For example, the expinv function can be used to compute inverses of exponential cumulative probabilities:

x = 0.5:0.2:1.5 % Outcomes
x =
    0.5000  0.7000  0.9000  1.1000  1.3000  1.5000
p = expcdf(x,1) % Cumulative probabilities
p =
    0.3935  0.5034  0.5934  0.6671  0.7275  0.7769
expinv(p,1) % Return original outcomes
ans =
    0.5000  0.7000  0.9000  1.1000  1.3000  1.5000

For discrete distributions, there may be no outcome whose cdf value is the input cumulative probability. In these cases, the inverse cdf returns the first outcome whose cdf value equals or exceeds the input cumulative probability.

For example, the binoinv function can be used to compute inverses of binomial cumulative probabilities:

x = 0.5:0.2:1.5 % Outcomes
x =
    0.5000  0.7000  0.9000  1.1000  1.3000  1.5000
p = binocdf(x,10,0.2) % Cumulative probabilities
p =
    0.1074  0.1074  0.1074  0.3758  0.3758  0.3758
>> binoinv(p,10,0.2) % Return binomial outcomes
ans =
     0  0  0  1  1  1

The inverse cdf is useful in hypothesis testing, where critical outcomes of a test statistic are computed from cumulative significance probabilities. For example, norminv can be used to compute a 95% confidence interval under the assumption of normal variability:

p = [0.025 0.975]; % Interval containing 95% of [0,1]
x = norminv(p,0,1) % Assume standard normal variability
x =
   -1.9600  1.9600 % 95% confidence interval

n = 20; % Sample size
y = normrnd(8,1,n,1); % Random sample (assume mean is unknown)
ybar = mean(y);
ci = ybar + (1/sqrt(n))*x % Confidence interval for mean
ci =
    7.6779  8.5544 

Distribution Statistics Functions

Distribution statistics functions for supported Statistics Toolbox distributions all end with stat, as in binostat or expstat. Specific function names for specific distributions can be found in Supported Distributions.

Each function represents a parametric family of distributions. Input arguments are lists of parameter values specifying a particular member of the distribution family. Functions return the mean and variance of the distribution, as a function of the parameters.

For example, the wblstat function can be used to visualize the mean of the Weibull distribution as a function of its two distribution parameters:

a = 0.5:0.1:3;
b = 0.5:0.1:3;
[A,B] = meshgrid(a,b);
M = wblstat(A,B);
surfc(A,B,M)

Distribution Fitting Functions

Fitting Regular Distributions

Distribution fitting functions for supported Statistics Toolbox distributions all end with fit, as in binofit or expfit. Specific function names for specific distributions can be found in Supported Distributions.

Each function represents a parametric family of distributions. Input arguments are arrays of data, presumed to be samples from some member of the selected distribution family. Functions return maximum likelihood estimates (MLEs) of distribution parameters, that is, parameters for the distribution family member with the maximum likelihood of producing the data as a random sample.

The Statistics Toolbox function mle is a convenient front end to the individual distribution fitting functions, and more. The function computes MLEs for distributions beyond those for which Statistics Toolbox software provides specific pdf functions.

For some pdfs, MLEs can be given in closed form and computed directly. For other pdfs, a search for the maximum likelihood must be employed. The search can be controlled with an options input argument, created using the statset function. For efficient searches, it is important to choose a reasonable distribution model and set appropriate convergence tolerances.

MLEs can be heavily biased, especially for small samples. As sample size increases, however, MLEs become unbiased minimum variance estimators with approximate normal distributions. This is used to compute confidence bounds for the estimates.

For example, consider the following distribution of means from repeated random samples of an exponential distribution:

mu = 1; % Population parameter
n = 1e3; % Sample size
ns = 1e4; % Number of samples
samples = exprnd(mu,n,ns); % Population samples
means = mean(samples); % Sample means

The Central Limit Theorem says that the means will be approximately normally distributed, regardless of the distribution of the data in the samples. The normfit function can be used to find the normal distribution that best fits the means:

[muhat,sigmahat,muci,sigmaci] = normfit(means)
muhat =
    1.0003
sigmahat =
    0.0319
muci =
    0.9997
    1.0010
sigmaci =
    0.0314
    0.0323

The function returns MLEs for the mean and standard deviation and their 95% confidence intervals.

To visualize the distribution of sample means together with the fitted normal distribution, you must scale the fitted pdf, with area = 1, to the area of the histogram being used to display the means:

numbins = 50;
hist(means,numbins)
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
hold on
[bincounts,binpositions] = hist(means,numbins);
binwidth = binpositions(2) - binpositions(1);
histarea = binwidth*sum(bincounts);
x = binpositions(1):0.001:binpositions(end);
y = normpdf(x,muhat,sigmahat);
plot(x,histarea*y,'r','LineWidth',2)

Fitting Piecewise Distributions

The parametric methods discussed in Fitting Regular Distributions fit data samples with smooth distributions that have a relatively low-dimensional set of parameters controlling their shape. These methods work well in many cases, but there is no guarantee that a given sample will be described accurately by any of the supported Statistics Toolbox distributions.

The empirical distributions computed by ecdf and discussed in Estimating Empirical CDFs assign equal probability to each observation in a sample, providing an exact match of the sample distribution. However, the distributions are not smooth, especially in the tails where data may be sparse.

The paretotails function fits a distribution by piecing together the empirical distribution in the center of the sample with smooth generalized Pareto distributions (GPDs) in the tails. The output is an object of the paretotails class, with associated methods to evaluate the cdf, inverse cdf, and other functions of the fitted distribution.

As an example, consider the following data, with about 20% outliers:

left_tail = -exprnd(1,10,1);
right_tail = exprnd(5,10,1);
center = randn(80,1);
data = [left_tail;center;right_tail];

Neither a normal distribution nor a t distribution fits the tails very well:

probplot(data);
p = fitdist(data,'tlocationscale');
h = probplot(gca,p); 
set(h,'color','r','linestyle','-')
title('{\bf Probability Plot}')
legend('Normal','Data','t','Location','NW')

On the other hand, the empirical distribution provides a perfect fit, but the outliers make the tails very discrete:

ecdf(data)

Random samples generated from this distribution by inversion might include, for example, values around 4.33 and 9.25, but nothing in-between.

The paretotails function provides a single, well-fit model for the entire sample. The following uses generalized Pareto distributions (GPDs) for the lower and upper 10% of the data:

pfit = paretotails(data,0.1,0.9)
pfit = 
Piecewise distribution with 3 segments
 -Inf < x < -1.30726 (0 < p < 0.1)
        lower tail, GPD(-1.10167,1.12395)

 -1.30726 < x < 1.27213 (0.1 < p < 0.9)
        interpolated empirical cdf

  1.27213 < x < Inf (0.9 < p < 1)
        upper tail, GPD(1.03844,0.726038)

x = -4:0.01:10;
plot(x,cdf(pfit,x))

Access information about the fit using the methods of the paretotails class. Options allow for nonparametric estimation of the center of the cdf.

Negative Log-Likelihood Functions

Negative log-likelihood functions for supported Statistics Toolbox distributions all end with like, as in explike. Specific function names for specific distributions can be found in Supported Distributions.

Each function represents a parametric family of distributions. Input arguments are lists of parameter values specifying a particular member of the distribution family followed by an array of data. Functions return the negative log-likelihood of the parameters, given the data.

Negative log-likelihood functions are used as objective functions in search algorithms such as the one implemented by the MATLAB function fminsearch. Additional search algorithms are implemented by Optimization Toolbox™ functions and Genetic Algorithm and Direct Search Toolbox™ functions.

When used to compute maximum likelihood estimates (MLEs), negative log-likelihood functions allow you to choose a search algorithm and exercise low-level control over algorithm execution. By contrast, the functions discussed in Distribution Fitting Functions use preset algorithms with options limited to those set by the statset function.

Likelihoods are conditional probability densities. A parametric family of distributions is specified by its pdf f (x,a), where x and a represent the variables and parameters, respectively. When a is fixed, the pdf is used to compute the density at x, f (x|a). When x is fixed, the pdf is used to compute the likelihood of the parameters a, f (a|x). The joint likelihood of the parameters over an independent random sample X is

Given X, MLEs maximize L(a) over all possible a.

In numerical algorithms, the log-likelihood function, log(L(a)), is (equivalently) optimized. The logarithm transforms the product of potentially small likelihoods into a sum of logs, which is easier to distinguish from 0 in computation. For convenience, Statistics Toolbox negative log-likelihood functions return the negative of this sum, since the optimization algorithms to which the values are passed typically search for minima rather than maxima.

For example, use gamrnd to generate a random sample from a specific gamma distribution:

a = [1,2];
X = gamrnd(a(1),a(2),1e3,1);

Given X, the gamlike function can be used to visualize the likelihood surface in the neighborhood of a:

mesh = 50;
delta = 0.5;
a1 = linspace(a(1)-delta,a(1)+delta,mesh);
a2 = linspace(a(2)-delta,a(2)+delta,mesh);
logL = zeros(mesh); % Preallocate memory
for i = 1:mesh
    for j = 1:mesh
        logL(i,j) = gamlike([a1(i),a2(j)],X);
    end
end
 
[A1,A2] = meshgrid(a1,a2);
surfc(A1,A2,logL)

The MATLAB function fminsearch is used to search for the minimum of the likelihood surface:

LL = @(u)gamlike([u(1),u(2)],X); % Likelihood given X
MLES = fminsearch(LL,[1,2])
MLES =
    1.0231    1.9729

These can be compared to the MLEs returned by the gamfit function, which uses a combination search and solve algorithm:

ahat = gamfit(X)
ahat =
    1.0231    1.9728

The MLEs can be added to the surface plot (rotated to show the minimum):

hold on
plot3(MLES(1),MLES(2),LL(MLES),...
      'ro','MarkerSize',5,...
      'MarkerFaceColor','r')

Random Number Generators

The Statistics Toolbox supports the generation of random numbers from various distributions. Each RNG represents a parametric family of distributions. RNGs return random numbers from the specified distribution in an array of the specified dimensions. Specific RNG names for specific distributions are in Supported Distributions.

Other random number generation functions which do not support specific distributions include:

RNGs in Statistics Toolbox software depend on MATLAB's default random number stream via the rand and randn functions, each RNG uses one of the techniques discussed in Common Generation Methods to generate random numbers from a given distribution.

By controlling the default random number stream and its state, you can control how the RNGs in Statistics Toolbox software generate random values. For example, to reproduce the same sequence of values from an RNG, you can save and restore the default stream's state, or reset the default stream. For details on managing the default random number stream, see Managing the Default Stream.

MATLAB initializes the default random number stream to the same state each time it starts up. Thus, RNGs in Statistics Toolbox software will generate the same sequence of values for each MATLAB session unless you modify that state at startup. One simple way to do that is to add commands to startup.m such as

stream = RandStream('mt19937ar','seed',sum(100*clock));
RandStream.setDefaultStream(stream);

that initialize MATLAB's default random number stream to a different state for each session.


Dependencies of the Random Number Generators

The following table lists the dependencies of Statistics Toolbox RNGs on the MATLAB base RNGs rand and/or randn.

RNGMATLAB Base RNG

betarnd

rand, randn

binornd

rand

chi2rnd

rand, randn

evrnd

rand

exprnd

rand

frnd

rand, randn

gamrnd

rand, randn

geornd

rand

gevrnd

rand

gprnd

rand

hygernd

rand

iwishrnd

rand, randn

johnsrnd

randn

lhsdesign

rand

lhsnorm

rand

lognrnd

randn

mhsample

rand or randn, depending on the RNG given for the proposal distribution

mvnrnd

randn

mvtrnd

rand, randn

nbinrnd

rand, randn

ncfrnd

rand, randn

nctrnd

rand, randn

ncx2rnd

randn

normrnd

randn

pearsrnd

rand or randn, depending on the distribution type

poissrnd

rand, randn

random

rand or randn, depending on the specified distribution

randsample

rand

raylrnd

randn

slicesample

rand

trnd

rand, randn

unidrnd

rand

unifrnd

rand

wblrnd

rand

wishrnd

rand, randn

  


Recommended Products

Includes the most popular MATLAB recorded presentations with Q&A sessions led by MATLAB experts.

 © 1984-2009- The MathWorks, Inc.    -   Site Help   -   Patents   -   Trademarks   -   Privacy Policy   -   Preventing Piracy   -   RSS