View License

Download apps, toolboxes, and other File Exchange content using Add-On Explorer in MATLAB.

» Watch video

Highlights from
Kernel Density Estimator

4.6 | 32 ratings Rate this file 132 Downloads (last 30 days) File Size: 5.5 KB File ID: #14034 Version: 1.5
image thumbnail

Kernel Density Estimator


Zdravko Botev (view profile)


21 Feb 2007 (Updated )

Reliable and extremely fast kernel density estimator for one-dimensional data

| Watch this File

File Information

Reliable and extremely fast kernel density estimator for one-dimensional data;
        Gaussian kernel is assumed and the bandwidth is chosen automatically;
        Unlike many other implementations, this one is immune to problems
        caused by multimodal densities with widely separated modes (see example). The
        estimation does not deteriorate for multimodal densities, because we never assume
        a parametric model for the data (like those used in rules of thumb).
     data - a vector of data from which the density estimate is constructed;
          n - the number of mesh points used in the uniform discretization of the
                    interval [MIN, MAX]; n has to be a power of two; if n is not a power of two, then
                    n is rounded up to the next power of two, i.e., n is set to n=2^ceil(log2(n));
                    the default value of n is n=2^12;
MIN, MAX - defines the interval [MIN,MAX] on which the density estimate is constructed;
                    the default values of MIN and MAX are:
                    MIN=min(data)-Range/10 and MAX=max(data)+Range/10, where Range=max(data)-min(data);
     bandwidth - the optimal bandwidth (Gaussian kernel assumed);
          density - column vector of length 'n' with the values of the density
                    estimate at the grid points;
          xmesh - the grid over which the density estimate is computed;
                    - If no output is requested, then the code automatically plots a graph of
                    the density estimate.
          cdf - column vector of length 'n' with the values of the cdf
 Kernel density estimation via diffusion
 Z. I. Botev, J. F. Grotowski, and D. P. Kroese (2010)
Annals of Statistics, Volume 38, Number 5, pages 2916-2957
Example (run in command window):
           data=[randn(100,1);randn(100,1)*2+35 ;randn(100,1)+55];


This file inspired H Coefficient, Kernel Density Estimator For High Dimensions, and Sim Out Utils.

MATLAB release MATLAB 8.5 (R2015a)
MATLAB Search Path
Tags for This File   Please login to tag files.
Please login to add a comment or rating.
Comments and Ratings (47)
21 Jul 2016 Stefano Rognoni

11 Jul 2015 Nuno Fachada

12 Apr 2015 Rashid Mehmood

Hello, Every body !
i am new in matlab. i am estimating density of 100 data points but it return density of 128 * 128 matrix . how to sum up to get only density of desired 100 data points.

Comment only
12 Mar 2015 Arpad Rozsas

17 Aug 2014 Marc Lalancette

Thanks, very useful. Strangely I get very different results on Matlab 2011b and 2013b with the same data. On the recent version, the density distribution is more smooth and has a stronger tendency to not go to 0 at the ends of the distribution. I'm guessing this is due to changes in a Matlab function. Any ideas?

11 Aug 2014 Genevieve

23 May 2014 Aniket Deshmukh

29 Apr 2014 lefteris

Dear Botev,

my data does not have meaning on negative values, but constructing histograms using kde returns frequencies on negative values and even if I determine the lower limit of x on zero, it returns on zero a big value. (i expect my histogram to start like x^2).


Comment only
13 Nov 2013 Bharath Ramesh

Dear Botev,

I have encountered a problem with your implementation and seeking your help. The PDFs obtained using translated versions of the signal (image histogram, in this case) is not the same.

data = [23 23 23 22 22 22 21 22 23];
data = [53 53 53 52 52 52 51 52 53];
MIN = 0
MAX = 255
n = 256;


This gives a good uni-modal estimate, whereas the second one is incomprehensible.


Please take a look at the density plots in each case.

This might be a problem with the bandwidth estimation but I don't know how to solve it.

Any help is appreciated.


For e.g:

Comment only
17 Dec 2012 Nicolas Beuchat

Nicolas Beuchat (view profile)

Brilliant! Saves me a lot of computation time and I gain in precision :-)


30 Aug 2012 Jiarui

Jiarui (view profile)

30 Aug 2012 Jiarui

Jiarui (view profile)

Hi Steven. It is the integral of the pdf function should be 1. So if your x-interval is very small, then the y-value of the pdf function could be larger than 1. E.g. A uniform distribution on x=[0,0.01]. Then y need to be 100 to make the integral 1.

26 Jul 2012 Steven Millard

I am using Botev tools and do not understand why the density function has values greater than one. I am knew to KDE and don't understand this yet. I figue a density function is suppose to add up to 1 when you integrate it?


Comment only
05 Jun 2012 Jiang

Jiang (view profile)


23 Feb 2012 NGO NHAT KHOA

Hi all, I have a problem with pdf estimate that needs your help. I am asked to verify that the probability of rv Z (number of trials) given that there are exactly 2 successes is a negative binomial distribution. I generate random numbers many times and record the number of trials required for 2 successes everytime. Then, I find the frequency at which each values of Z occur (Ex: repeating 1000 times the experiment, I find that only 60 times that the number of trials is 10 for exact 2 successes to occur, hence the frequency is 60/1000 = 0.06). After that, I try to estimate the pdf of Z using Kernel and compare with the plot by using nbinpdf available in MATLAB but the result is so terrible. I'm thinking of using kde function but do not know how to use. I really aprreciate your help, please.

Comment only
27 Jan 2012 Amir

Amir (view profile)

16 Jan 2012 Gabriel

Zdravko's kernel density estimator works a lot more quicker than traditional methods although I am getting spurious artifacts due to too low a bandwidth selected of 0.02 (a third smaller than when i used another selector which minimised expected L2 loss between estimate and underlying). The latter bandwidth works smoothly but takes a bit longer. Also, I get negative densities at the outliers so I adjusted the minmax boundaries. Is there a way to alter the estimator to avoid this issue?

07 Jan 2012 Zhang

Zhang (view profile)

hi, it's a really a fast and robust script. I have a question about what the time complexity (in terms of data size n) is, namely O(n) or O(n^2)? Could someone provide some time complexity analysis ? Great thanks~

21 Jun 2011 Rory Staunton

Question: is there any way to incorporate observation weights? I have calculated a weight based on other considerations (measurement error and goodness of fit, e.g.) for each data point in my distribution and want to incorporate this into the density estimate. Thanks in advance.

Comment only
10 Apr 2011 Babak Abbasi

Babak Abbasi (view profile)

Thanks, It helps a lot.

14 Mar 2011 Mi Matthew

Thanks a lot!It works very well.

03 Feb 2011 Ekaterina Zaytseva

Exellent script. Very fast and efficient. I have a question is there a similar script for a m-dimentional data (with m>2)?

Comment only
25 Jan 2011 James Ford

My apologies... I think the 13 Jan 2011 update fixed that crash (the 100 length vector now works).

Comment only
25 Jan 2011 James Ford

This has been excellent in general. A few times it has crashed at line 57:


because "??? Error using ==> fzero at 283
The function values at the interval endpoints must differ in sign."

The data doesn't look obviously bad in these cases. A short example vector is [14.0534 13.2851 13.0951 13.1159 14.2221] (this is a shortened version of a length 100 vector that also crashed).

21 Jan 2011 Zdravko Botev

Zdravko Botev (view profile)

Due to numerical round-off error from the fft.m function, it is possible to get density values of -1.38e-018 (instead of 0) and cdf values slightly larger than 1.
If this is a problem, one can correct the output from kde by overwriting:

density=max(density,0); cdf=min(1,cdf);

Comment only
13 Jan 2011 dk

dk (view profile)

The author fixed the bug and it works without a problem. Good job!

05 Nov 2010 dk

dk (view profile)

The code crashes at line 57 when length(data) is small. e.g kde(rand(100,1)) or kde(randn(30,1)).

Comment only
15 Oct 2010 Zdravko Botev

Zdravko Botev (view profile)

Dear George, the kde function works as it should. There is no problem with the kde. What you call a problem is actually one of the main strengths of the routine.

By typing data = [d1;d1;d1;d1;d1;d2;d3];
you are creating DISCRETE data, because you create ties (the same values appear multiple times). For a truly continuous data, there can be no ties or repeated values!!!
If you have ties, then the data CANNOT be continuous be definition.

The kde.m CORRECTLY recognizes that the data you have provided is perfectly discrete and since discrete data does not need smoothing, the selected bandwidth should be zero. kde.m is the only routine I am aware of that does this correctly, every other routine fails this BASIC theoretical test.

Comment only
15 Oct 2010 george. holzwarth

Botev's kernel density estimator works admirably for me, except with weighted data, where the bandwidth selector "fails".
kde finds a bandwidth of about 0.6, which is reasonable.

Now weight the first Gaussian 5 times:
data = [d1;d1;d1;d1;d1;d2;d3];
kde now finds a bandwidth of 0.001, which is not reasonable.
Is there a way to enter weighted data sets or change the bandwidth estimator to avoid this problem?

07 Oct 2010 Ángel yustres

Extremely fast and easy to use.

09 Apr 2010 star

star (view profile)

Could someone provide me with a code for nonparametric bayesian density estimation using a dirichlet prior? I am stuck.

Comment only
12 Jan 2010 Ange

Ange (view profile)

Fantastic script - fast and easy to use!

08 Jan 2010 oluwole ogunkelu

can someone provide me with hierarchical token bucket(HTB) algorithm used to optimize bandwidth? kinda stucked

Comment only
07 Jan 2010 David

David (view profile)

i am using your above code and my data is plotting density values well over 1 (i.e. >500). I looked at your example

% Example:
% data=[randn(100,1);randn(100,1)*2+35 ;randn(100,1)+55];
% [bandwidth,density,xmesh]=kde(data,2^14,min(data)-5,max(data)+5);
% plot(xmesh,density)

but even then, sum(density) = 235.6368, which obviously is greater than 1. It should be 1 if it's a pdf right?

So, does your code generate a pdf? Or is scaled in some other way? If it is not a pdf, do you know how to convert it to a pdf? (do you just normalize by sum(density) ?)

01 Jun 2009 Dazhi Jiang

I think the new version just missed the heading line. Please check it. But still good job. Thank.

06 Feb 2009 Karin Petrini

25 Jan 2009 Michael Jordan

24 Jan 2009 red

red (view profile)

Yes, the method seems to scale the function so that it becomes a pdf. But my data do not represent a pdf. How can I modify the method so that it works for general (nondensity) estimation?

16 Dec 2008 Nathanael Yoder

Nathanael Yoder (view profile)

I was incorrect but there does seem to be a scale factor on the density functions

Comment only
16 Dec 2008 Nathanael Yoder

Nathanael Yoder (view profile)

Great code but I believe line 83 should be:
instead of:
in order to get an accurately scaled density function.

10 Oct 2008 Alban Notin

Thanks for sharing this code. However using Matlab 6.5R13 I had to debug it: inputs arguments I,a2,N not specified for function fixed_point (for example).

07 Oct 2008 DI YANG

Not Bad, But this program is only available for 1d data. But it is still useful some questions . In any way Thanks for sharing.

29 May 2008 Andreas H.

23 May 2008 rao chen

thanks a lot! it's good.

Comment only
04 Apr 2008 Adel Daas

21 Mar 2008 Tingju Zhu

Highly recommend this! Very fast and robust.

15 Mar 2007 Dirk Kroese

Check this out. Much better than the currently available density estimation procedures!

26 May 2009 1.1

updated the reference - now a journal paper submitted to the Annals of Statistics

28 Jun 2009 1.3

As pointed out by Dazhi Jiang in the comments section, the healine
"function [bandwidth,density,xmesh]=kde(data,n,MIN,MAX)"
is missing. This version corrects this editing mistake.

07 Mar 2010 1.4

-Published in the Annals of Statistics, 2010, see Section 5.
- works on old versions of Matlab without nested functions.
- plots a graph when no output is requested

13 Jan 2011 1.5

- the updated version provides additionally a cdf estimator as an output argument
- designed not to crash for small number of data, e.g., kde(rand(1,5))
- published reference updated

29 May 2015 1.5

bug fixes: 1) in some rare cases with small 'n', fzero used to fail; code now deals with these failures;
                 2) density output forced to be positive (may be small and negative due to round-off errors, confusing some users)

30 Dec 2015 1.5

corrected the title back to "kernel density estimator" ; updated reference

Contact us