Even though it is not directly MATLAB related, I figured I would pose this question to the MATLAB community because there are a bunch of smart and helpful people here :D
I have looked and looked but I cannot find a straightforward test or method to characterize a distribution that fails a normality test. I have read several peer-reviewed scientific journal articles where this does not stop authors from giving a mean and standard deviation (!) but I think that is a bad thing to do.
My current approach is to get a kernel smoothing density estimate of the distribution using a function I wrote around the built-in ksdensity() function, and play with the smoothing window width until it gives something that nicely portrays the data (not too spikey, not too round). I then give the peak value of the kernel estimate as my "mean" (i.e. the one number people will look at and prematurely judge everything by). The only way I know to then characterize the distribution width or deviation would be to give a full width at half maximum. Of course this is not good because the distribution tends not to be symmetric around the peak, and is often on the order of the peak value in magnitude.
So people I am working with want to see some kind of error bars, and I have no idea what to give them to make them happy.
This is a recurring theme in my current work and I am desperate to find a good solution, so any pointers would be greatly appreciated. I am sure I am not the only one who has to deal with non-gaussian distributions.
If you want to see an example of one of these distributions, there are a couple in Figure 3 in the paper you can find here:
Thanks in advance, Rory
You should NOT use the peak of your distribution to estimate the mean, because it is not the mean. It is the mode.
For estimating the errors in these statistics, you could use the boostrap or the jacknife (see Resampling Statistics).
You could also explore MATLAB's collection of distributions to see if any look like your data (see Distribution Reference). For example, some of the curves look like the Gamma distribution. However, each distribution is a model of a particular kind of statistical process, so ideally you should understand what a distribution represents before using it.
Some other things I might consider:
1. Look at distributions of the log(data).
2. Consider using the median and quartiles (it may be more intuitive to use the interquartile range) or other quantiles. It may be possible to find theoretical ways to compute confidence intervals for those quantities, but the bootstrap approach may be adequate. Also Google for "five number summary."
3. There are larger families of distributions that include the normal as a special case. Look into the Johnson and Pearson families. There are Statistics Toolbox functions johnsrnd and pearsrnd for generating random samples from these distributions, but the "fitting" step is simply computing quantiles or moments.
I think a good distribution would be the Weibull and it is available in the statistics toolbox. You could then use the distributions parameters to compare datasets rather than mean & standard deviation
you can get confidence intervals for the parameters - would that suffice for error bars?