## Documentation Center |

On this page… |
---|

Gaussian mixture models are formed by combining
multivariate normal density components In Statistics Toolbox™ software,
use the `gmdistribution` class
to fit data using an expectation maximization (EM)
algorithm, which assigns posterior probabilities to each component
density with respect to each observation. The fitting method uses
an iterative algorithm that converges to a local optimum. Clustering
using Gaussian mixture models is sometimes considered a soft clustering
method. The posterior probabilities for each point indicate that each
data point has some probability of belonging to each cluster.

For more information on clustering with Gaussian mixture models, see Gaussian Mixture Models. This section describes their creation.

**Specifying a Model. **Use the `gmdistribution` constructor
to create Gaussian mixture models with specified means, covariances,
and mixture proportions.

First, define the means, covariances, and mixture proportions.

MU = [1 2;-3 -5]; % Means SIGMA = cat(3,[2 0;0 .5],[1 0;0 1]); % Covariances p = ones(1,2)/2; % Mixing proportions

Then, create an object of the `gmdistribution` class defining a two-component
mixture of bivariate Gaussian distributions:

obj = gmdistribution(MU,SIGMA,p);

Display properties of the object with the MATLAB^{®} function `fieldnames`:

properties = fieldnames(obj)

properties = 'NDimensions' 'DistName' 'NComponents' 'PComponents' 'mu' 'Sigma' 'NlogL' 'AIC' 'BIC' 'Converged' 'Iters' 'SharedCov' 'CovType' 'RegV'

The `gmdistribution` reference
page describes these properties. To access the value of a property,
use dot indexing. For example, access the dimensions of the object.

dimension = obj.NDimensions

dimension = 2

Access the distribution name.

name = obj.DistName

name = gaussian mixture distribution

Use the methods `pdf` and `cdf` to compute values and visualize the
object:

figure ezsurf(@(x,y)pdf(obj,[x y]),[-10 10],[-10 10])

figure ezsurf(@(x,y)cdf(obj,[x y]),[-10 10],[-10 10])

**Fitting a Model to Data. **You can also create Gaussian mixture models by fitting a parametric
model with a specified number of components to data. `fitgmdist` uses the syntax `obj
= fitgmdist(X,k)`, where `X` is a data matrix
and `k` is the specified number of components. Choosing
a suitable number of components `k` is essential
for creating a useful model of the data—too few components
fails to model the data accurately; too many components leads to an
over-fit model with singular covariance matrices.

The following example illustrates this approach.

First, create some data from a mixture of two bivariate Gaussian
distributions using the `mvnrnd` function:

```
MU1 = [1 2];
SIGMA1 = [2 0; 0 .5];
MU2 = [-3 -5];
SIGMA2 = [1 0; 0 1];
X = [mvnrnd(MU1,SIGMA1,1000);
mvnrnd(MU2,SIGMA2,1000)];
figure
scatter(X(:,1),X(:,2),10,'.')
```

Next, fit a two-component Gaussian mixture model:

options = statset('Display','final'); obj = fitgmdist(X,2,'Options',options); hold on h = ezcontour(@(x,y)pdf(obj,[x y]),[-8 6],[-8 6]); hold off

18 iterations, log-likelihood = -7058.35

Among the properties of the fit are the parameter estimates.

Display the estimates for mu, sigma, and mixture proportions

ComponentMeans = obj.mu ComponentCovariances = obj.Sigma MixtureProportions = obj.PComponents

ComponentMeans = -2.9617 -4.9727 0.9539 2.0261 ComponentCovariances(:,:,1) = 1.0100 0.0059 0.0059 0.9897 ComponentCovariances(:,:,2) = 1.9939 -0.0092 -0.0092 0.4981 MixtureProportions = 0.5000 0.5000

The two-component model minimizes the Akaike information:

AIC = zeros(1,4); obj = cell(1,4); for k = 1:4 obj{k} = fitgmdist(X,k); AIC(k)= obj{k}.AIC; end [minAIC,numComponents] = min(AIC); numComponents

numComponents = 2

Display the model.

model = obj{2}

model = Gaussian mixture distribution with 2 components in 2 dimensions Component 1: Mixing proportion: 0.500000 Mean: -2.9617 -4.9727 Component 2: Mixing proportion: 0.500000 Mean: 0.9539 2.0261

Both the Akaike and Bayes information are negative log-likelihoods for the data with penalty terms for the number of estimated parameters. You can use them to determine an appropriate number of components for a model when the number of components is unspecified.

Use the method `random` of the `gmdistribution` class to generate random data
from a Gaussian mixture model created with `gmdistribution` or `fitgmdist`.

For example, the following specifies a `gmdistribution` object
consisting of a two-component mixture of bivariate Gaussian distributions:

MU = [1 2;-3 -5]; SIGMA = cat(3,[2 0;0 .5],[1 0;0 1]); p = ones(1,2)/2; obj = gmdistribution(MU,SIGMA,p);

```
figure
ezcontour(@(x,y)pdf(obj,[x y]),[-10 10],[-10 10])
hold on
```

Use `random(gmdistribution)` to generate
1000 random values:

```
Y = random(obj,1000);
scatter(Y(:,1),Y(:,2),10,'.')
```

*Copulas* are functions
that describe dependencies among variables, and provide a way to create
distributions that model correlated multivariate data. Using a copula,
you can construct a multivariate distribution by specifying marginal
univariate distributions, and then choose a copula to provide a correlation
structure between variables. Bivariate distributions, as well as distributions
in higher dimensions, are possible.

One of the design decisions for a Monte Carlo simulation is a choice of probability distributions for the random inputs. Selecting a distribution for each individual variable is often straightforward, but deciding what dependencies should exist between the inputs may not be. Ideally, input data to a simulation should reflect what you know about dependence among the real quantities you are modeling. However, there may be little or no information on which to base any dependence in the simulation. In such cases, it is useful to experiment with different possibilities in order to determine the model's sensitivity.

It can be difficult to generate random inputs with dependence when they have distributions that are not from a standard multivariate distribution. Further, some of the standard multivariate distributions can model only limited types of dependence. It is always possible to make the inputs independent, and while that is a simple choice, it is not always sensible and can lead to the wrong conclusions.

For example, a Monte-Carlo simulation of financial risk could have two random inputs that represent different sources of insurance losses. You could model these inputs as lognormal random variables. A reasonable question to ask is how dependence between these two inputs affects the results of the simulation. Indeed, you might know from real data that the same random conditions affect both sources; ignoring that in the simulation could lead to the wrong conclusions.

**Generate and Exponentiate Normal Random Variables. **

The `lognrnd` function simulates independent lognormal random variables. In the following example, the `mvnrnd` function generates `n` pairs of independent normal random variables, and then exponentiates them. Notice that the covariance matrix used here is diagonal.

n = 1000; sigma = .5; SigmaInd = sigma.^2 .* [1 0; 0 1] rng('default'); % For reproducibility ZInd = mvnrnd([0 0],SigmaInd,n); XInd = exp(ZInd); plot(XInd(:,1),XInd(:,2),'.') axis([0 5 0 5]) axis equal xlabel('X1') ylabel('X2')

SigmaInd = 0.2500 0 0 0.2500

Dependent bivariate lognormal random variables are also easy to generate using a covariance matrix with nonzero off-diagonal terms.

rho = .7; SigmaDep = sigma.^2 .* [1 rho; rho 1] ZDep = mvnrnd([0 0],SigmaDep,n); XDep = exp(ZDep);

SigmaDep = 0.2500 0.1750 0.1750 0.2500

A second scatter plot demonstrates the difference between these two bivariate distributions.

plot(XDep(:,1),XDep(:,2),'.') axis([0 5 0 5]) axis equal xlabel('X1') ylabel('X2')

It is clear that there is a tendency in the second data set for large values of `X1` to be associated with large values of `X2`, and similarly for small values. The correlation parameter
of the underlying bivariate normal determines this dependence. The conclusions drawn from the simulation could well depend on whether you generate `X1` and `X2` with dependence. The bivariate lognormal distribution is a simple solution in this case; it easily generalizes to higher dimensions in cases where the marginal distributions are different lognormals.

Other multivariate distributions also exist. For example, the multivariate *t* and the Dirichlet distributions simulate dependent *t* and beta random variables, respectively. But the list of simple multivariate distributions is not long, and they only apply in cases where the marginals are all in the same family (or even the exact same distributions). This can be a serious limitation in many situations.

Although the construction discussed in the previous section creates a bivariate lognormal that is simple, it serves to illustrate a method that is more generally applicable.

Generate pairs of values from a bivariate normal distribution. There is statistical dependence between these two variables, and each has a normal marginal distribution.

Apply a transformation (the exponential function) separately to each variable, changing the marginal distributions into lognormals. The transformed variables still have a statistical dependence.

If a suitable transformation can be found, this method can be generalized to create dependent bivariate random vectors with other marginal distributions. In fact, a general method of constructing such a transformation does exist, although it is not as simple as exponentiation alone.

By definition, applying the normal cumulative distribution function
(cdf), denoted here by Φ, to a standard normal random variable
results in a random variable that is uniform on the interval [0,1].
To see this, if *Z* has a standard normal
distribution, then the cdf of *U* = Φ(*Z*)
is

and that is the cdf of a Unif(0,1) random variable. Histograms of some simulated normal and transformed values demonstrate that fact:

n = 1000; rng('default') % for reproducibility z = normrnd(0,1,n,1); % generate standard normal data hist(z,-3.75:.5:3.75) % plot the histogram of data xlim([-4 4]) title('1000 Simulated N(0,1) Random Values') xlabel('Z') ylabel('Frequency') set(get(gca,'Children'),'FaceColor',[.8 .8 1])

u = normcdf(z); % compute the cdf values of the sample data figure hist(u,.05:.1:.95) % plot the histogram of the cdf values title('1000 Simulated N(0,1) Values Transformed to Unif(0,1)') xlabel('U') ylabel('Frequency') set(get(gca,'Children'),'FaceColor',[.8 .8 1])

Borrowing from the theory of univariate random number generation,
applying the inverse cdf of any distribution, *F*,
to a Unif(0,1) random variable results in a random variable whose
distribution is exactly *F* (see Inversion Methods). The proof is essentially
the opposite of the preceding proof for the forward case. Another
histogram illustrates the transformation to a gamma distribution:

x = gaminv(u,2,1); % transform to gamma values figure hist(x,.25:.5:9.75) % plot the histogram of gamma values title('1000 Simulated N(0,1) Values Transformed to Gamma(2,1)') xlabel('X') ylabel('Frequency') set(get(gca,'Children'),'FaceColor',[.8 .8 1])

You can apply this two-step transformation to each variable of a standard bivariate normal, creating dependent random variables with arbitrary marginal distributions. Because the transformation works on each component separately, the two resulting random variables need not even have the same marginal distributions. The transformation is defined as:

where *G*_{1} and *G*_{2} are
inverse cdfs of two possibly different distributions. For example,
the following generates random vectors from a bivariate distribution
with *t*_{5} and
Gamma(2,1) marginals:

n = 1000; rho = .7; Z = mvnrnd([0 0],[1 rho; rho 1],n); U = normcdf(Z); X = [gaminv(U(:,1),2,1) tinv(U(:,2),5)]; % draw the scatter plot of data with histograms figure scatterhist(X(:,1),X(:,2),'Direction','out')

This plot has histograms alongside a scatter plot to show both the marginal distributions, and the dependence.

The correlation parameter, *ρ*,
of the underlying bivariate normal determines the dependence between `X1` and `X2` in
this construction. However, the linear correlation of `X1` and `X2` is
not *ρ*. For example, in the original
lognormal case, a closed form for that correlation is:

which is strictly less than *ρ*,
unless *ρ* is exactly 1. In more
general cases such as the Gamma/*t* construction,
the linear correlation between `X1` and `X2` is
difficult or impossible to express in terms of *ρ*,
but simulations show that the same effect happens.

That is because the linear correlation coefficient expresses
the linear dependence between random variables, and when nonlinear
transformations are applied to those random variables, linear correlation
is not preserved. Instead, a rank correlation coefficient, such as
Kendall's τ or Spearman's *ρ*,
is more appropriate.

Roughly speaking, these rank correlations measure the degree
to which large or small values of one random variable associate with
large or small values of another. However, unlike the linear correlation
coefficient, they measure the association only in terms of ranks.
As a consequence, the rank correlation is preserved under any monotonic
transformation. In particular, the transformation method just described
preserves the rank correlation. Therefore, knowing the rank correlation
of the bivariate normal *Z* exactly determines
the rank correlation of the final transformed random variables, *X*.
While the linear correlation coefficient, *ρ*,
is still needed to parameterize the underlying bivariate normal, Kendall's *τ* or
Spearman's *ρ* are more useful in
describing the dependence between random variables, because they are
invariant to the choice of marginal distribution.

For the bivariate normal, there is a simple one-to-one mapping
between Kendall's *τ* or Spearman's *ρ*,
and the linear correlation coefficient *ρ*:

The following plot shows the relationship.

rho = -1:.01:1; tau = 2.*asin(rho)./pi; rho_s = 6.*asin(rho./2)./pi; plot(rho,tau,'b-','LineWidth',2) hold on plot(rho,rho_s,'g-','LineWidth',2) plot([-1 1],[-1 1],'k:','LineWidth',2) axis([-1 1 -1 1]) xlabel('rho') ylabel('Rank correlation coefficient') legend('Kendall''s {\it\tau}', ... 'Spearman''s {\it\rho_s}', ... 'location','NW')

Thus, it is easy to create the desired rank correlation between `X1` and `X2`,
regardless of their marginal distributions, by choosing the correct *ρ* parameter
value for the linear correlation between `Z1` and `Z2`.

For the multivariate normal distribution, Spearman's rank correlation is almost identical to the linear correlation. However, this is not true once you transform to the final random variables.

The first step of the construction described in the previous
section defines what is known as a bivariate Gaussian copula. A copula
is a multivariate probability distribution, where each random variable
has a uniform marginal distribution on the unit interval [0,1]. These
variables may be completely independent, deterministically related
(e.g., `U2 = U1`), or anything in between.
Because of the possibility for dependence among variables, you can
use a copula to construct a new multivariate distribution for dependent
variables. By transforming each of the variables in the copula separately
using the inversion method, possibly using different cdfs, the resulting
distribution can have arbitrary marginal distributions. Such multivariate
distributions are often useful in simulations, when you know that
the different random inputs are not independent of each other.

Statistics Toolbox functions compute:

Probability density functions (

`copulapdf`) and the cumulative distribution functions (`copulacdf`) for Gaussian copulasRank correlations from linear correlations (

`copulastat`) and vice versa (`copulaparam`)Random vectors (

`copularnd`)Parameters for copulas fit to data (

`copulafit`)

For example, use the `copularnd` function
to create scatter plots of random values from a bivariate Gaussian
copula for various levels of *ρ*,
to illustrate the range of different dependence structures. The family
of bivariate Gaussian copulas is parameterized by the linear correlation
matrix:

`U1` and `U2` approach
linear dependence as *ρ* approaches
±1, and approach complete independence as ρ approaches
zero:

n = 500; rng('default') % for reproducibility U = copularnd('Gaussian',[1 .8; .8 1],n); subplot(2,2,1) plot(U(:,1),U(:,2),'.') title('{\it\rho} = 0.8') xlabel('U1') ylabel('U2') U = copularnd('Gaussian',[1 .1; .1 1],n); subplot(2,2,2) plot(U(:,1),U(:,2),'.') title('{\it\rho} = 0.1') xlabel('U1') ylabel('U2') U = copularnd('Gaussian',[1 -.1; -.1 1],n); subplot(2,2,3) plot(U(:,1),U(:,2),'.') title('{\it\rho} = -0.1') xlabel('U1') ylabel('U2') U = copularnd('Gaussian',[1 -.8; -.8 1],n); subplot(2,2,4) plot(U(:,1),U(:,2),'.') title('{\it\rho} = -0.8') xlabel('U1') ylabel('U2')

The dependence between `U1` and `U2` is
completely separate from the marginal distributions of `X1
= G(U1)` and `X2 = G(U2)`. `X1` and `X2` can
be given any marginal distributions, and still have the same rank
correlation. This is one of the main appeals of copulas—they
allow this separate specification of dependence and marginal distribution.
You can also compute the pdf (`copulapdf`)
and the cdf (`copulacdf`) for a
copula. For example, these plots show the pdf and cdf for *ρ* =
.8:

u1 = linspace(1e-3,1-1e-3,50); u2 = linspace(1e-3,1-1e-3,50); [U1,U2] = meshgrid(u1,u2); Rho = [1 .8; .8 1]; f = copulapdf('t',[U1(:) U2(:)],Rho,5); f = reshape(f,size(U1)); figure() surf(u1,u2,log(f),'FaceColor','interp','EdgeColor','none') view([-15,20]) xlabel('U1') ylabel('U2') zlabel('Probability Density')

u1 = linspace(1e-3,1-1e-3,50); u2 = linspace(1e-3,1-1e-3,50); [U1,U2] = meshgrid(u1,u2); F = copulacdf('t',[U1(:) U2(:)],Rho,5); F = reshape(F,size(U1)); figure() surf(u1,u2,F,'FaceColor','interp','EdgeColor','none') view([-15,20]) xlabel('U1') ylabel('U2') zlabel('Cumulative Probability')

A different family of copulas can be constructed by starting
from a bivariate *t* distribution and
transforming using the corresponding *t* cdf.
The bivariate *t* distribution is parameterized
with *P*, the linear correlation matrix,
and *ν*, the degrees of freedom.
Thus, for example, you can speak of a *t*_{1} or
a *t*_{5} copula,
based on the multivariate *t* with one
and five degrees of freedom, respectively.

Just as for Gaussian copulas, Statistics Toolbox functions
for *t* copulas compute:

Probability density functions (

`copulapdf`) and the cumulative distribution functions (`copulacdf`) for Gaussian copulasRank correlations from linear correlations (

`copulastat`) and vice versa (`copulaparam`)Random vectors (

`copularnd`)Parameters for copulas fit to data (

`copulafit`)

For example, use the `copularnd` function
to create scatter plots of random values from a bivariate *t*_{1} copula
for various levels of *ρ*, to illustrate
the range of different dependence structures:

n = 500; nu = 1; rng('default') % for reproducibility U = copularnd('t',[1 .8; .8 1],nu,n); subplot(2,2,1) plot(U(:,1),U(:,2),'.') title('{\it\rho} = 0.8') xlabel('U1') ylabel('U2') U = copularnd('t',[1 .1; .1 1],nu,n); subplot(2,2,2) plot(U(:,1),U(:,2),'.') title('{\it\rho} = 0.1') xlabel('U1') ylabel('U2') U = copularnd('t',[1 -.1; -.1 1],nu,n); subplot(2,2,3) plot(U(:,1),U(:,2),'.') title('{\it\rho} = -0.1') xlabel('U1') ylabel('U2') U = copularnd('t',[1 -.8; -.8 1],nu, n); subplot(2,2,4) plot(U(:,1),U(:,2),'.') title('{\it\rho} = -0.8') xlabel('U1') ylabel('U2')

A *t* copula has uniform marginal
distributions for `U1` and `U2`,
just as a Gaussian copula does. The rank correlation *τ* or *ρ*_{s} between
components in a *t* copula is also the
same function of *ρ* as for a Gaussian.
However, as these plots demonstrate, a *t*_{1} copula
differs quite a bit from a Gaussian copula, even when their components
have the same rank correlation. The difference is in their dependence
structure. Not surprisingly, as the degrees of freedom parameter *ν* is
made larger, a *t*_{ν} copula
approaches the corresponding Gaussian copula.

As with a Gaussian copula, any marginal distributions can be
imposed over a *t* copula. For example,
using a *t* copula with 1 degree of freedom,
you can again generate random vectors from a bivariate distribution
with Gamma(2,1) and *t*_{5} marginals
using `copularnd`:

n = 1000; rho = .7; nu = 1; rng('default') % for reproducibility U = copularnd('t',[1 rho; rho 1],nu,n); X = [gaminv(U(:,1),2,1) tinv(U(:,2),5)]; figure() scatterhist(X(:,1),X(:,2),'Direction','out')

Compared to the bivariate Gamma/*t* distribution
constructed earlier, which was based on a Gaussian copula, the distribution
constructed here, based on a *t*_{1} copula,
has the same marginal distributions and the same rank correlation
between variables but a very different dependence structure. This
illustrates the fact that multivariate distributions are not uniquely
defined by their marginal distributions, or by their correlations.
The choice of a particular copula in an application may be based on
actual observed data, or different copulas may be used as a way of
determining the sensitivity of simulation results to the input distribution.

The Gaussian and *t* copulas are
known as elliptical copulas. It is easy to generalize elliptical copulas
to a higher number of dimensions. For example, simulate data from
a trivariate distribution with Gamma(2,1), Beta(2,2), and *t*_{5} marginals
using a Gaussian copula and `copularnd`,
as follows:

n = 1000; Rho = [1 .4 .2; .4 1 -.8; .2 -.8 1]; rng('default') % for reproducibility U = copularnd('Gaussian',Rho,n); X = [gaminv(U(:,1),2,1) betainv(U(:,2),2,2) tinv(U(:,3),5)];

Plot the data.

subplot(1,1,1) plot3(X(:,1),X(:,2),X(:,3),'.') grid on view([-55, 15]) xlabel('X1') ylabel('X2') zlabel('X3')

Notice that the relationship between the linear correlation
parameter *ρ* and, for example,
Kendall's *τ*, holds for each entry
in the correlation matrix *P* used here.
You can verify that the sample rank correlations of the data are approximately
equal to the theoretical values:

tauTheoretical = 2.*asin(Rho)./pi

tauTheoretical = 1.0000 0.2620 0.1282 0.2620 1.0000 -0.5903 0.1282 -0.5903 1.0000

tauSample = corr(X,'type','Kendall')

tauSample = 1.0000 0.2581 0.1414 0.2581 1.0000 -0.5790 0.1414 -0.5790 1.0000

Statistics Toolbox functions are available for three bivariate Archimedean copula families:

Clayton copulas

Frank copulas

Gumbel copulas

These are one-parameter families that are defined directly in terms of their cdfs, rather than being defined constructively using a standard multivariate distribution.

To compare these three Archimedean copulas to the Gaussian and *t* bivariate
copulas, first use the `copulastat` function
to find the rank correlation for a Gaussian or *t* copula
with linear correlation parameter of 0.8, and then use the `copulaparam` function to find the Clayton
copula parameter that corresponds to that rank correlation:

tau = copulastat('Gaussian',.8 ,'type','kendall')

tau = 0.5903

alpha = copulaparam('Clayton',tau,'type','kendall')

alpha = 2.8820

Finally, plot a random sample from the Clayton copula with `copularnd`. Repeat the same procedure
for the Frank and Gumbel copulas:

n = 500; U = copularnd('Clayton',alpha,n); subplot(3,1,1) plot(U(:,1),U(:,2),'.'); title(['Clayton Copula, {\it\alpha} = ',sprintf('%0.2f',alpha)]) xlabel('U1') ylabel('U2') alpha = copulaparam('Frank',tau,'type','kendall'); U = copularnd('Frank',alpha,n); subplot(3,1,2) plot(U(:,1),U(:,2),'.') title(['Frank Copula, {\it\alpha} = ',sprintf('%0.2f',alpha)]) xlabel('U1') ylabel('U2') alpha = copulaparam('Gumbel',tau,'type','kendall'); U = copularnd('Gumbel',alpha,n); subplot(3,1,3) plot(U(:,1),U(:,2),'.') title(['Gumbel Copula, {\it\alpha} = ',sprintf('%0.2f',alpha)]) xlabel('U1') ylabel('U2')

To simulate dependent multivariate data using a copula, you must specify each of the following:

The copula family (and any shape parameters)

The rank correlations among variables

Marginal distributions for each variable

Suppose you have return data for two stocks and want to run a Monte Carlo simulation with inputs that follow the same distributions as the data:

load stockreturns nobs = size(stocks,1); subplot(2,1,1) hist(stocks(:,1),10) xlim([-3.5 3.5]) xlabel('X1') ylabel('Frequency') set(get(gca,'Children'),'FaceColor',[.8 .8 1]) subplot(2,1,2) hist(stocks(:,2),10) xlim([-3.5 3.5]) xlabel('X2') ylabel('Frequency') set(get(gca,'Children'),'FaceColor',[.8 .8 1])

You could fit a parametric model separately to each dataset, and use those estimates as the marginal distributions. However, a parametric model may not be sufficiently flexible. Instead, you can use a nonparametric model to transform to the marginal distributions. All that is needed is a way to compute the inverse cdf for the nonparametric model.

The simplest nonparametric model is the empirical cdf, as computed
by the `ecdf` function.
For a discrete marginal distribution, this is appropriate. However,
for a continuous distribution, use a model that is smoother than the
step function computed by `ecdf`. One
way to do that is to estimate the empirical cdf and interpolate between
the midpoints of the steps with a piecewise linear function. Another
way is to use kernel smoothing with `ksdensity`.
For example, compare the empirical cdf to a kernel smoothed cdf estimate
for the first variable:

[Fi,xi] = ecdf(stocks(:,1)); figure() stairs(xi,Fi,'b','LineWidth',2) hold on Fi_sm = ksdensity(stocks(:,1),xi,'function','cdf','width',.15); plot(xi,Fi_sm,'r-','LineWidth',1.5) xlabel('X1') ylabel('Cumulative Probability') legend('Empirical','Smoothed','Location','NW') grid on

For the simulation, experiment with different copulas and correlations.
Here, you will use a bivariate *t* copula
with a fairly small degrees of freedom parameter. For the correlation
parameter, you can compute the rank correlation of the data.

nu = 5; tau = corr(stocks(:,1),stocks(:,2),'type','kendall')

tau = 0.5180

Find the corresponding linear correlation parameter for the *t* copula
using `copulaparam`.

rho = copulaparam('t', tau, nu, 'type','kendall')

rho = 0.7268

Next, use `copularnd` to
generate random values from the *t* copula and
transform using the nonparametric inverse cdfs. The `ksdensity` function allows you to make a
kernel estimate of distribution and evaluate the inverse cdf at the
copula points all in one step:

```
n = 1000;
U = copularnd('t',[1 rho; rho 1],nu,n);
```

X1 = ksdensity(stocks(:,1),U(:,1),... 'function','icdf','width',.15); X2 = ksdensity(stocks(:,2),U(:,2),... 'function','icdf','width',.15);

Alternatively, when you have a large amount of data or need to simulate more than one set of values, it may be more efficient to compute the inverse cdf over a grid of values in the interval (0,1) and use interpolation to evaluate it at the copula points:

p = linspace(0.00001,0.99999,1000); G1 = ksdensity(stocks(:,1),p,'function','icdf','width',0.15); X1 = interp1(p,G1,U(:,1),'spline'); G2 = ksdensity(stocks(:,2),p,'function','icdf','width',0.15); X2 = interp1(p,G2,U(:,2),'spline'); scatterhist(X1,X2,'Direction','out')

The marginal histograms of the simulated data are a smoothed
version of the histograms for the original data. The amount of smoothing
is controlled by the bandwidth input to `ksdensity`.

This example shows how to use `copulafit` to calibrate copulas with data. To generate data `Xsim` with a distribution "just like" (in terms of marginal distributions and correlations) the distribution of data in the matrix `X` , you need to fit marginal distributions to the columns of `X` , use appropriate cdf functions to transform `X` to `U` , so that `U` has values between 0 and 1, use `copulafit` to fit a copula to `U` , generate new data `Usim` from the copula, and use appropriate inverse cdf functions to transform `Usim` to `Xsim` .

Load and plot the simulated stock return data.

load stockreturns x = stocks(:,1); y = stocks(:,2); scatterhist(x,y,'Direction','out')

Transform the data to the copula scale (unit square) using a kernel estimator of the cumulative distribution function.

u = ksdensity(x,x,'function','cdf'); v = ksdensity(y,y,'function','cdf'); scatterhist(u,v,'Direction','out') xlabel('u') ylabel('v')

Fit a *t* copula.

[Rho,nu] = copulafit('t',[u v],'Method','ApproximateML')

Rho = 1.0000 0.7220 0.7220 1.0000 nu = 3.3004e+06

Generate a random sample from the *t* copula.

r = copularnd('t',Rho,nu,1000); u1 = r(:,1); v1 = r(:,2); scatterhist(u1,v1,'Direction','out') xlabel('u') ylabel('v') set(get(gca,'children'),'marker','.')

Transform the random sample back to the original scale of the data.

x1 = ksdensity(x,u1,'function','icdf'); y1 = ksdensity(y,v1,'function','icdf'); scatterhist(x1,y1,'Direction','out') set(get(gca,'children'),'marker','.')

As the example illustrates, copulas integrate naturally with other distribution fitting functions.

Was this topic helpful?