Code covered by the BSD License

Download apps, toolboxes, and other File Exchange content using Add-On Explorer in MATLAB.

### Highlights from Fit all valid parametric probability distributions to data

4.92
4.9 | 26 ratings Rate this file 137 Downloads (last 30 days) File Size: 5.3 KB File ID: #34943 Version: 1.4

# Fit all valid parametric probability distributions to data

### Mike Sheppard (view profile)

06 Feb 2012 (Updated )

ALLFITDIST Fit all valid parametric probability distributions to data.

### Editor's Notes:

This file was selected as MATLAB Central Pick of the Week

File Information
Description

ALLFITDIST Fit all valid parametric probability distributions to data.
[D PD] = ALLFITDIST(DATA) fits all valid parametric probability distributions to the data in vector DATA, and returns a struct D of fitted distributions and parameters and a struct of objects PD representing the fitted distributions. PD is an object in a class derived from the ProbDist class.

[...] = ALLFITDIST(DATA,SORTBY) returns the struct of valid distributions sorted by the parameter SORTBY
NLogL - Negative of the log likelihood
BIC - Bayesian information criterion (default)
AIC - Akaike information criterion
AICc - AIC with a correction for finite sample sizes

[...] = ALLFITDIST(...,'DISCRETE') specifies it is a discrete distribution and does not attempt to fit a continuous distribution to the data

[...] = ALLFITDIST(...,'PDF') or (...,'CDF') plots either the PDF or CDF of a subset of the fitted distribution. The distributions are plotted in order of fit, according to SORTBY.

List of distributions it will try to fit
Continuous (default)
Beta
Birnbaum-Saunders
Exponential
Extreme value
Gamma
Generalized extreme value
Generalized Pareto
Inverse Gaussian
Logistic
Log-logistic
Lognormal
Nakagami
Normal
Rayleigh
Rician
t location-scale
Weibull

Discrete ('DISCRETE')
Binomial
Negative binomial
Poisson

Optional inputs:
[...] = ALLFITDIST(...,'n',N,...)
For the 'binomial' distribution only:
'n' A positive integer specifying the N parameter (number of trials). Not allowed for other distributions. If 'n' is not given it is estimate by Method of Moments. If the estimated 'n' is negative then the maximum value of data will be used as the estimated value.
[...] = ALLFITDIST(...,'theta',THETA,...)
For the 'generalized pareto' distribution only:
'theta' The value of the THETA (threshold) parameter for the generalized Pareto distribution. Not allowed for other distributions. If 'theta' is not given it is estimated by the minimum value of the data.

Note: ALLFITDIST does not handle nonparametric kernel-smoothing, use FITDIST directly instead.

EXAMPLE 1
Given random data from an unknown continuous distribution, find the best distribution which fits that data, and plot the PDFs to compare graphically.
data = normrnd(5,3,1e4,1); %Assumed from unknown distribution
[D PD] = allfitdist(data,'PDF'); %Compute and plot results
D(1) %Show output from best fit

EXAMPLE 2
Given random data from a discrete unknown distribution, with frequency data, find the best discrete distribution which would fit that data, sorted by 'NLogL', and plot the PDFs to compare graphically.
data = nbinrnd(20,.3,1e4,1);
values=unique(data); freq=histc(data,values);
[D PD] = allfitdist(values,'NLogL','frequency',freq,'PDF','DISCRETE');
PD{1}

EXAMPLE 3
Although the Geometric Distribution is not listed, it is a special case of fitting the more general Negative Binomial Distribution. The parameter 'r' should be close to 1. Show by example.
data=geornd(.7,1e4,1); %Random from Geometric
[D PD]= allfitdist(data,'PDF','DISCRETE');
PD{1}

EXAMPLE 4
Compare the resulting distributions under two different assumptions of discrete data. The first, that it is known to be derived from a Binomial Distribution with known 'n'. The second, that it may be Binomial but 'n' is unknown and should be estimated. Note the second scenario may not yield a Binomial Distribution as the best fit, if 'n' is estimated incorrectly. (Best to run example a couple times to see effect)
data = binornd(10,.3,1e2,1);
[D1 PD1] = allfitdist(data,'n',10,'DISCRETE','PDF'); %Force binomial
[D2 PD2] = allfitdist(data,'DISCRETE','PDF'); %May be binomial
PD1{1}, PD2{1} %Compare distributions

Acknowledgements

This file inspired Kstest Plot(X,Cdf Model,Cdf Data).

Required Products Statistics and Machine Learning Toolbox
MATLAB release MATLAB 7.12 (R2011a)
Other requirements Note: Requires Statistics Toolbox
Tags for This File   Please login to tag files.
Comments and Ratings (40)
15 Jul 2016 Rita

### Rita (view profile)

One question! Can I use this function if I have gaps in my data?
Thanks

Comment only
29 Jun 2016 Eva Brayfindley

### Eva Brayfindley (view profile)

10 Apr 2016 KBundy

### KBundy (view profile)

18 Mar 2016 Ueli Rutishauser

### Ueli Rutishauser (view profile)

07 Oct 2015 Meysam Vadiati

### Meysam Vadiati (view profile)

Hi Mike

Comment only
18 Aug 2015 Arvid Dujardin

### Arvid Dujardin (view profile)

tafteh,

The Probability Density plot is actually scaled to the bin width (see line 349 in the code). I suppose this is done to obtain comparable values for the bar-plot (empirical) and the results of the fit (pdf-function on line 351).

I would like to know why the bar-plot is scaled to the bin width, rather than using the same bin width for bar- and pdf-plots.

Also, in the bar plot the maximum value of the data is never considered: on line 348 histc(data,xi-dx) is used rather than histc(data,xi). Why is this?

29 Jul 2015 tafteh

### tafteh (view profile)

Hi Mike,

Thanks for your brilliant job in this script.

However I came across one weird results:
The probability Density Function plot produces the y-axis scaled from 0 to 2.5, and the peak of the fitted distributions are going high up to "2." Is it right?

I would appreciate any help,
Thanks,

21 Jul 2015 Danilo Gaspar

### Danilo Gaspar (view profile)

Very useful script.

Hi Abdullahi Salman, to show the output results properly you should index the variable, as shown by the example, D(1).

23 Jun 2015 Abdullahi Salman

### Abdullahi Salman (view profile)

Awesome script. Am however having a little problem. D and PD are not outputting any result. I can see the plot of the pdf though. this is what am getting:

D =
1x6 struct array with fields:
DistName
NLogL
BIC
AIC
AICc
ParamNames
ParamDescription
Params
Paramci
ParamCov
Support

PD =

Columns 1 through 5

[1x1 ProbDistUnivParam] [1x1 ProbDistUnivParam] [1x1 ProbDistUnivParam] [1x1 ProbDistUnivParam] [1x1 ProbDistUnivParam]

Column 6

[1x1 ProbDistUnivParam]

I will appreciate any help. Thank you.

16 Jun 2015 Anshul Goyal

### Anshul Goyal (view profile)

07 Apr 2015 Vassilios Vonikakis

### Vassilios Vonikakis (view profile)

very easy and direct to use

10 Mar 2015 John Knag

### John Knag (view profile)

Super easy to use and very helpful. Thank you.

20 Feb 2015 Roudy DAGHER

### Roudy DAGHER (view profile)

Hi Mike,

that's a very nice script.
It would be also useful to test against mixtures, for instance when the data can be fit to a mixture of two or more gaussians, with the parameter k increasing...
see fitgmdist Matlab function.

Best,
Roudy

20 Jan 2015 SANHANAT

### SANHANAT (view profile)

best MATLAB code so far

22 Aug 2014 Alireza

### Alireza (view profile)

The allfitdist function for normally distributed data return 'rayleigh' as best fit distribution! So weird as it is an example included in file.

commands: data = normrnd(5,3,1e4,1); [D PD] = allfitdist(data,'PDF'); D(1)

output: ans =

DistName: 'rayleigh'
NLogL: 2.4515e+04 - 1.5959e+03i
BIC: 4.9038e+04 - 3.1919e+03i
AIC: 4.9031e+04 - 3.1919e+03i
AICc: 4.9031e+04 - 3.1919e+03i
ParamNames: {'B'}
ParamDescription: {'scale'}
Params: 4.1166
Paramci: [2x1 double]
ParamCov: 4.2366e-04
Support: [1x1 struct]

Comment only
02 Jul 2014 Nebitno

### Nebitno (view profile)

28 Jan 2014 sonakis23 sonaki

### sonakis23 sonaki (view profile)

Hi, I was wondering how could I plot both PDF, CDF and the error graph any ideas?

Comment only
26 Nov 2013 debora

### debora (view profile)

@Hernando

I've the same problem. You need to change all ~ (line 245 and others) by another letter.

25 Nov 2013 Hernando

### Hernando (view profile)

Well i`m using r2009a. and using the file i've got this error:
??? Error: File: allfitdist.m Line: 245 Column: 11
Expression or statement is incorrect--possibly unbalanced (, {, or [.

[D PD] = allfitdist(data,'CCDF');
??? Error: File: allfitdist.m Line: 245 Column: 11
Expression or statement is incorrect--possibly unbalanced (, {, or [.
data = normrnd(5,3,1e4,1);
>> [D PD] = allfitdist(data,'CCDF');
??? Undefined function or method 'allfitdist' for input arguments of type 'double'.
Is there any restriction for the file?

Comment only
16 Sep 2013 Venkatesh

### Venkatesh (view profile)

Very useful script

14 Aug 2013 Shebuti Rayana

### Shebuti Rayana (view profile)

I am using Matlab R2008a version I am trying to use this code but its not working Its showing no distributions were found for the example no 1. I checked my matlab version and it contains Statistics toolbox. Now what should I do. Please help.

Comment only
09 Jul 2013 katmai46

### katmai46 (view profile)

Dear Mr. Sheppard,

I have been used your code to fit several datasets that I have. I found it really useful. My question is (I am very new in Matlab as well as statistics)... how do you define the "best" distribution? Based on p-values of KSTest?
Thanks

22 Aug 2012 Manuel Kuhs

### Manuel Kuhs (view profile)

Really appreciate your function, was doing this manually for a while!

I apologise in advance if this is an ignorant question, as I'm a very basic MatLab user.

Would it be possible to amend your script to take into account for situations in which you know some data is missing? The particular type I'm interested in is when I know that my data actually only represents e.g. the first 70% on the CDF.

I hope this question makes sense. I'm not even sure of the right terminology to use!

03 May 2012 Nitin

### Nitin (view profile)

25 Mar 2012 Olga Petrik

### Olga Petrik (view profile)

15 Mar 2012 Roni Peer

### Roni Peer (view profile)

Great Job.
I've changed it a bit to suit my needs, and going to add a GUI to allow the user to fit just a specific distribution, or select some of them. ALL of them would be a default.
Thanks!

13 Mar 2012 Mike Sheppard

### Mike Sheppard (view profile)

Hi Roni,

The "Best Fit" can be found by the output by either D(1) or PD{1}, depending on if you want a structure or ProbDist class object. You can use the class object directly in other statistical functions, such as:

p=cdf(PD{1},xvalue)

The reason for including all valid distributions is that depending on preferences of model selection or assumptions from the data the distribution that you may prefer to use may be the 2nd or even 3rd "best" from the output, or not given at all. This is especially true if the SORTBY values are close in value, or if a parameter in a given distribution is close to a simpler special case.

Example 3 is an example of the latter; should you use as a model the Negative Binomial Distribution with r=.98 or assume it is actually the more simpler Geometric Distribution with r=1 which is not given as an output?

The error graph is displayed when 'CDF' is given as an input. You can change the number of distributions to include in the plot by adjusting the max_num_dist variable in the plotfigs subfunction.

Hope that helps,

-MIke

Comment only
13 Mar 2012 Roni Peer

### Roni Peer (view profile)

Hi Mike,

Why not add a "Best Fit" output also?
For example, if the best distribution which represents this data is "Weibull", return it as another output.
This can be used to find "Best Fit" for this data, which can be really useful.
I would also add a summary graph, which shows the error on all types of distributions, and what was the best one.

Roni.

Comment only
07 Mar 2012 Eric Diaz

### Eric Diaz (view profile)

Very useful indeed!

28 Feb 2012 Francesco Cosentino

### Francesco Cosentino (view profile)

Hi Guys,

the problem at lines 247 etc is resolved by replacing the tilde operator with any name for a variable that will remain unused, but for the problem that also Olivier noted, this is due to the fact that function fitdist is missing in matlab 7.7

Regards
Francesco

Comment only
28 Feb 2012 Francesco Cosentino

### Francesco Cosentino (view profile)

Hi people,

This script is not working on matlab 7.7.

Matlab recognises an error in the code at line 247. It says:

Parse error at ',': usage might be invalid matlab syntax
Parse error at ']': usage might be invalid matlab syntax

And the error is repeated for lines 249 249 251 253.

Is there any way of getting it working on 7.7???

Regards
Francesco

Comment only
15 Feb 2012 Mike Sheppard

### Mike Sheppard (view profile)

Warwick, thanks for your note. I am updating the file a bit, and the functionality of custom distributions seems interesting.

If you like, you can e-mail me directly with your improved functionality and I can include it in the next update with acknowledgment.

Comment only
14 Feb 2012 Warwick

### Warwick (view profile)

Mike, I am sorry and aghast about the rating. I actually meant to leave the rating blank. On further experiment, there seems to be no way to go back to a null rating once my cursor merely touches the rating banner of stars (using iMac and the beta R2012a) . Anyway, I was able to use the file to obtain sorted best-fit curves on the type of problems I have and even added custom dist.

14 Feb 2012 Mike Sheppard

### Mike Sheppard (view profile)

Warwick, for a "potentially a very useful script" I'm sorry you felt it was only worth a rating of one. Do you have suggestions on how it can be improved? Constructive criticism or ways to improve the program/functionality are always welcome, but I did not see any in your comment, other than asking for specific help after giving it a poor rating.

Please re-read the help section; specifically Example 2.

Comment only
14 Feb 2012 Warwick

### Warwick (view profile)

Mike, this is potentially a very useful script for me. How can I use it for this example problem? I have frequency data describing number of events against day number. Logically the day number must be an integer from 1.
Eg, for discrete days 1:10 and the Yobs are [1099 478 263 159 99 64 41 28 18 12]. Exponential and Weibull are fair candidate distrubution and I have previously fitted these as curves using LS or weighted LS, but an MLE approach ( ie, use neg log likelihood) would be much better as there can be a lot of noise in the tails. Thanks, Warwick

13 Feb 2012 Jiro Doke

### Jiro Doke (view profile)

Olivier,

Do you have Statistics Toolbox? It's required to use this function.

Comment only
12 Feb 2012 Tony Dalton

### Tony Dalton (view profile)

10 Feb 2012 Matthew

### Matthew (view profile)

Great idea, good examples, functional code (style could be better).

10 Feb 2012 Olivier Planchon

### Olivier Planchon (view profile)

Does not work on Matlab 7.7
(Or I misunderstood how to use it)

>> [D, PD] = allfitdist(randn(1000,1)) ;
??? Error using ==> allfitdist at 238
No distributions were found

Comment only
07 Feb 2012 Jonathan Sullivan

### Jonathan Sullivan (view profile)

07 Feb 2012 1.1

Included error checking for NaNs in data set and/or frequency; and dimension mismatch between data and frequency

07 Feb 2012 1.2

Corrected y-axis labels

17 Feb 2012 1.3

Fixed frequency data with binomial; generalized pareto as special case; and cleaned up code

04 Apr 2012 1.4

Updated help section