How to check and remove outliers when it is Non-normal distribution

335 views (last 30 days)
I found that many people say z-score and mapstd standardization is good to detect outlier. But z-score is useful when only it is normal distribution. When I found my data doesn't follow normal distribution. What should I do? (1)Should i transform my data(boxcox,Johnson transformation) into normal distribution and use z-score to detect outlier? (2)After transformation and remove the outliers, should I use my transformed data or original data(outliers removed in both data) to be the input of neural network? I found that if I input my transformed data(Johnson transformation) into neural network, it works worse than the original data.How come is it?
Can anybody help.Thanks a lot.

Accepted Answer

Star Strider
Star Strider on 18 Nov 2015
The z-score is frequently used because according to the Central Limit Theorem, when the data are sufficiently numerous, the tend to be normally distributed regardless of the underlying distribution. (There is more to it that this simple statement, but that is the most basic explanation.)
If you know how your data are distributed, you can get the ‘critical values’ of the 0.025 and 0.975 probabilities for it and use them as your decision criteria to reject outliers. Again, outlier detection and rejection is another topic that goes beyond this simple explanation, and I encourage you to explore it on your own. How you decide to implement it with your data is something you will have to experiment with.
  3 Comments
Greg Heath
Greg Heath on 19 Nov 2015
As I have mentioned in my answer
Using zscore is so useful for detecting outliers in
nonnormal distributions, I use it most of the time.
Again:
For outlier detection I recommend using the
combination of zscore and plots with all non-binary data.
Greg
Star Strider
Star Strider on 19 Nov 2015
My pleasure.
A data set n>30 will approximate a normal distribution if it is otherwise t-distributed, but you would have to look at your data to see if they approximate a normal distribution. If you have any doubts as to its distribution, I would use one of the histogram functions, and if you have the Statistics Toolbox, the histfit function.
The most reliable way to determine if your data are normally distributed is to use the Statistics Toolbox Kolmogorov-Smirnov test, implemented in the kstest function. Another related test for the normal and other distributions is the Lilliefors test, implemented in the lillietest function.
If you don’t have the Statistics Toolbox, one simple test is to see if the median approximates the mean. It should for normally-distributed data, but will not for other distributions. (I leave the interpretation of ‘approximates’ to you, in the context of your data. They should be virtually the same for normally-distributed data.) You can also use the randn function with the mean and std of your data, then use a histogram function to compare them. The randn call would be (with ‘data’ being your data):
data_mean = mean(data);
data_std = std(data);
data_sim = data_mean + data_std*randn(size(data));
If your data turn out to be normally-distributed, you can certainly use the z-score reliably to scale them or test them with respect to detecting outliers. In the limit (which is to say a huge number of observations), the CLT would certainly apply. However N=89 is not huge, so you will have to analyse your data and see how they are distributed.

Sign in to comment.

More Answers (1)

Greg Heath
Greg Heath on 18 Nov 2015
Regardless of the distribution, I find that a combination of zscore with plots of original and transformed data is sufficient for me to detect outliers. Whether points are deleted or replaced by a reduced value depends on how I interpret the plots.
If you have doubts you can always make multiple models based on original and modified data.
Hope this helps.
Thank you for formally accepting my answer
Greg
  2 Comments
J1
J1 on 19 Nov 2015
If we found there are outlier, should i find out more variables to predict my output? Such as, I use weather data to predict the sales of product.And I found that the outlier is due to the promotion or other reasons, should i add this new reasons(new variables) into the neural network to do prediction?
Greg Heath
Greg Heath on 19 Nov 2015
Outliers are usually isolated points that are the result of bad measurements or bad transcriptions. Therefore they should be removed. However, if you plot the data, very often you can guess the approximate true value of the measurement. Then you have the option of replacing the outlier with the approximation.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!