How to check and remove outliers when it is Non-normal distribution

Question

J1 on 18 Nov 2015

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/255870-how-to-check-and-remove-outliers-when-it-is-non-normal-distribution

Commented: Star Strider on 19 Nov 2015

I found that many people say z-score and mapstd standardization is good to detect outlier. But z-score is useful when only it is normal distribution. When I found my data doesn't follow normal distribution. What should I do? (1)Should i transform my data(boxcox,Johnson transformation) into normal distribution and use z-score to detect outlier? (2)After transformation and remove the outliers, should I use my transformed data or original data(outliers removed in both data) to be the input of neural network? I found that if I input my transformed data(Johnson transformation) into neural network, it works worse than the original data.How come is it?

Can anybody help.Thanks a lot.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Star Strider on 18 Nov 2015

1
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/255870-how-to-check-and-remove-outliers-when-it-is-non-normal-distribution#answer_200342

The z-score is frequently used because according to the Central Limit Theorem, when the data are sufficiently numerous, the tend to be normally distributed regardless of the underlying distribution. (There is more to it that this simple statement, but that is the most basic explanation.)

If you know how your data are distributed, you can get the ‘critical values’ of the 0.025 and 0.975 probabilities for it and use them as your decision criteria to reject outliers. Again, outlier detection and rejection is another topic that goes beyond this simple explanation, and I encourage you to explore it on your own. How you decide to implement it with your data is something you will have to experiment with.

3 Comments
Show 1 older commentHide 1 older comment

Greg Heath on 19 Nov 2015

Open in MATLAB Online

As I have mentioned in my answer

 Using zscore is so useful for detecting outliers in 
nonnormal distributions, I use it most of the time.

Again:

 For outlier detection I recommend using the  
combination of zscore and plots with all non-binary data.

Greg

Star Strider on 19 Nov 2015

Open in MATLAB Online

My pleasure.

A data set n>30 will approximate a normal distribution if it is otherwise t-distributed, but you would have to look at your data to see if they approximate a normal distribution. If you have any doubts as to its distribution, I would use one of the histogram functions, and if you have the Statistics Toolbox, the histfit function.

The most reliable way to determine if your data are normally distributed is to use the Statistics Toolbox Kolmogorov-Smirnov test, implemented in the kstest function. Another related test for the normal and other distributions is the Lilliefors test, implemented in the lillietest function.

If you don’t have the Statistics Toolbox, one simple test is to see if the median approximates the mean. It should for normally-distributed data, but will not for other distributions. (I leave the interpretation of ‘approximates’ to you, in the context of your data. They should be virtually the same for normally-distributed data.) You can also use the randn function with the mean and std of your data, then use a histogram function to compare them. The randn call would be (with ‘data’ being your data):

data_mean = mean(data);
data_std  = std(data);
data_sim  = data_mean + data_std*randn(size(data));

If your data turn out to be normally-distributed, you can certainly use the z-score reliably to scale them or test them with respect to detecting outliers. In the limit (which is to say a huge number of observations), the CLT would certainly apply. However N=89 is not huge, so you will have to analyse your data and see how they are distributed.

Sign in to comment.

Answer 2

Greg Heath on 18 Nov 2015

2
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/255870-how-to-check-and-remove-outliers-when-it-is-non-normal-distribution#answer_200385

Regardless of the distribution, I find that a combination of zscore with plots of original and transformed data is sufficient for me to detect outliers. Whether points are deleted or replaced by a reduced value depends on how I interpret the plots.

If you have doubts you can always make multiple models based on original and modified data.

Hope this helps.

Thank you for formally accepting my answer

Greg

2 Comments
Show NoneHide None

J1 on 19 Nov 2015

If we found there are outlier, should i find out more variables to predict my output? Such as, I use weather data to predict the sales of product.And I found that the outlier is due to the promotion or other reasons, should i add this new reasons(new variables) into the neural network to do prediction?

Greg Heath on 19 Nov 2015

Outliers are usually isolated points that are the result of bad measurements or bad transcriptions. Therefore they should be removed. However, if you plot the data, very often you can guess the approximate true value of the measurement. Then you have the option of replacing the outlier with the approximation.

Sign in to comment.

How to check and remove outliers when it is Non-normal distribution

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

3 Comments
Show 1 older commentHide 1 older comment

More Answers (1)

2 Comments
Show NoneHide None

See Also

Categories

Tags

Products

Community Treasure Hunt

How to check and remove outliers when it is Non-normal distribution

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

3 Comments Show 1 older commentHide 1 older comment

More Answers (1)

2 Comments Show NoneHide None

See Also

Categories

Tags

Products

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

3 Comments
Show 1 older commentHide 1 older comment

2 Comments
Show NoneHide None