# Help with probability density functions

20 views (last 30 days)
Jules Ray on 25 Feb 2019
Commented: Jeff Miller on 26 Feb 2019
Hello I have created several pdf's using the formula below (60 pdf's). I would like to calculate the mean of all these pdfs but I have not idea how to do this.
Here is the formula I used to create each of the pdfs, L1 is a structure that contain a matrix that contain the Z values. I used in this example the structure L1(1), I have 60 more of these structures so they go from L1(1).Z to L1(60).Z, I calculated pdf's for each of these pupulations of Z.
%pdf for the structure L1(1).Z
pd = fitdist(L1(1).Z(:),'Normal');
x_pdf = [min(L1(1).Z(:)):0.01:max(L1(1).Z(:))];
y = pdf(pd,x_pdf);
Thanks in advance for any help

John D'Errico on 25 Feb 2019
Edited: John D'Errico on 25 Feb 2019
This is not a question about MATLAB. But you have done something.
You do not have a PDF. You have an approximation to a PDF, sampled over a finite range, at a finite set of steps.
If you wish to compute the mean of a random variable with known distribution parameters, you would be best advised to use resources like wikipedia. Here, for example:
Note that it is stated on that page (look on the right side) the mean and variance of a Lognormal distribution, given the usual distribution parameters.
As well, since you are using fitdist, you already have the stats toolbox. So you have access to tools like lognstat (or the corresponding tool for whatever distribution you are using). Use the available tools. Do NOT try to cobble up code to do what you do not really understand. Writing code to do what already exists for you to use is just a bad idea when you have no clue as to what you are doing. (What evidence do I have that you have no clue about these things? It is that you don't know how to compute the mean of a continuous random variable. At worst, something immediately found online.)
Can you compute the mean of a distribution where the PDF is approximated at a finite set of points? Well, yes. You might want to read about the mean of a continuous distribution.
In there, you will find that the mean of a random variable is given as
distibutionmean = int(x*pdf(x),-inf,inf)
So you want to compute the integral of x times the pdf(x), integrating from -inf to inf. In the case of a distribution like the lognormal, the pdf only lives on [0,inf) so that would be the bounds of interest.
Now, if I compute the actual mean and variance of a standard lognormal PDF, thus with distribution parameters of [0,1], I will find that the mean is exp(1/2).
exp(0 + 1^2/2)
ans =
1.64872127070013
[m,v] = lognstat(0,1)
m =
1.64872127070013
v =
4.67077427047161
As you see, lognstat agrees with my estimate of the mean.
Now, lets try it for a lognormal, approximated as you did.
x = 0:.01:10;
trapz(x,x.*lognpdf(x))
ans =
1.4898533607038
As you can see, trapz did not do very well here, off by roughly 10%. The problem was not that I did not sample the PDF finely enough either, or the integration error of trapz.
The problem is that this does not sample the lognormal PDF sufficiently far into the tails. The lognormal distribution has a heavy right tail.
logncdf(10)
ans =
0.9893489006583
Even trapz agrees with that measure.
trapz(x,lognpdf(x))
ans =
0.989348905931384
So, CAN you compute the means of those approximate PDFs? Well, yes, you can use trapz to do so, as I showed. Should you? Sigh.
##### 2 CommentsShowHide 1 older comment
John D'Errico on 25 Feb 2019
I'm sorry, but this still makes little sense. What is the mean of two PDFs together? What do you intend by that statement?
Taking points from many PDFs, then concatenating them together, and then using a normal fitdist on the result? Again, sorry, but that makes little mathematical or statistical sense.
You need to understand that the numbers generated by EVALUATING a PDF are not in themselves random variables. For example, if I did this:
x = linspace(-3,3,100);
p = normpdf(x);
you cannot simply add the vector p to another such construct, and have it mean something statistically, as I think you are trying to do.
This is a mistake I've seen others make. They confuse random variables, for example, the output from randn or rand, with the output from a PDF, such as normpdf.
As such, if I compute something like
mean(p)
ans =
0.1646
that is NOT the mean of the distribution. Nor does it make sense to form the sum of two such vectors. Finally, it makes absolutely no sense at all to then try to throw p into a tool like fitdist.
parms = fitdist(p.','normal')
parms =
NormalDistribution
Normal distribution
mu = 0.164598 [0.136784, 0.192411]
sigma = 0.140175 [0.123074, 0.162837]
In fact, the true mean and variance of the normal distribution inside randn has a mean of zero, and a variance of 1.
mean(randn(1,100000))
ans =
-0.0015908
var(randn(1,100000))
ans =
1.0028
Remember that these are only sample statistics, so they will approach the true mean and variance only as the sample size gets large.
Anyway, I think you are confused as to what a PDF means. I think you need to do a serious amount of reading about these things, as it looks like you are just trying to do virtually random things, and thinking they work. I would suggest a good starting course in probability and statistics, or at least a good basic text. Any such text would probably be fine.

Jules Ray on 25 Feb 2019
I think we are not understanding each other, and instead of suggesting going to the university again I would suggest reading the question more carefully
I have this data L1, which is empirical, I obtained a PDF for this data, as every member of L1 comprises several measurements grouped in 60 groups. So I wanted to obtain the mean between these groups. Make this more sense?
Best
Jeff Miller on 26 Feb 2019
> i know but in my case all are normal distribuitions
OK then, suppose one is a normal with mean 0 and sigma 1, and the other is a normal with mean 1 and sigma 1. What would you say is "the mean of these two pdfs"?
> I want to know the probability of certain elevation in the whole data
If this is what you want to know, I am not sure why you are messing around with normal distributions in the first place. Why not simply combine the data from all 60 sites into one large dataset and tabulate the frequency of each different elevation? Doesn't that give you exactly this probability?