finding extreme values in a panel data set

Asked by salva on 22 Jul 2012
Latest activity Commented on by salva on 22 Jul 2012

Dear Matlab experts

I have

A={

    [1]    ' '              [NaN]     [NaN]     [NaN]     
    [1]    '30/11/08'    [11.1447]    [0.2117]    [463.1360]
    [1]    '28/12/08'    [11.7129]    [0.2209]    [436.6316]
    [1]    '25/01/09'    [11.5152]    [0.2212]    [441.5430]
    [1]    '22/02/09'    [11.4854]    [0.2201]    [453.7015]
    [1]    '22/03/09'    [NaN]       [0.2185]    [461.3925]
    [1]    '19/04/09'    [10.9700]    [0.2104]    [486.4095]
    [1]    '17/05/09'    [11.4846]    [0.2162]    [451.4833]
    [1]    '14/06/09'    [10.9967]    [0.2158]    [475.8620]
    [1]    '12/07/09'    [11.1990]    [0.2211]    [449.4574]
    [2]    ' '             [NaN]     [NaN]     [NaN]     
    [2]    '30/11/08'    [10.8638]    [0.2175]    [472.2659]
    [2]    '28/12/08'    [11.0897]    [0.2182]    [479.3408]
    [2]    '25/01/09'    [11.0996]    [0.2286]    [442.6719]
    [2]    '22/02/09'    [11.0473]    [0.2211]    [481.4548]
    [2]    '22/03/09'    [11.1151]    [0.2259]    [468.2757]
    [2]    '19/04/09'    [10.9865]    [0.2300]    [461.5581]
    [2]    '17/05/09'    [10.7931]    [0.2259]    [487.6257]
    [2]    '14/06/09'    [10.5845]    [0.2200]    [529.8777]
    [2]    '12/07/09'    [10.8926]    [0.2315]    [433.3039]
    [3]    ' '             [NaN]     [NaN]     [NaN]     
    [3]    '30/11/08'    [10.7658]    [0.2339]    [487.7729]
    [3]    '28/12/08'    [10.5227]    [0.2305]    [505.4553]
    [3]    '25/01/09'    [10.6700]    [0.2279]    [480.0745]
    [3]    '22/02/09'    [10.7632]    [NaN]    [518.5076]
    [3]    '22/03/09'    [10.5065]    [0.2271]    [470.0111]
    [3]    '19/04/09'    [10.6112]    [0.2215]    [483.5394]
    [3]    '17/05/09'    [10.5855]    [0.2190]    [497.3172]
    [3]    '14/06/09'    [10.3085]    [0.2158]    [515.7492]
    [3]    '12/07/09'    [10.4475]    [0.2211]    [482.3380]
}

A is a panel data set that shows the evolution of some variables (columns 3-5) over time (column 2) for 3 individuals (column 1).

I would like

1) to detect outliers (their position and their corresponding values) in columns 3, 4 and 5 where an outlier is defined as a “a value that it is larger than (or equal to) 3 standard deviations over the mean”.

2) plot a diagram for each of the columns 3, 4 and 5 where outliers will stand out with, say, a red color

Could I do that with some matlab code?

Thank you very much

just to mention that my real A is huge where I have 400 individuals.

1 Comment

Image Analyst on 22 Jul 2012

That's far, far from huge. Do you want this done separately for each group listed in column 1? It looks complicated by the fact that not only do NaNs separate the groups listed in column 1, but there are random Nan's scattered about.

salva

Tags

Products

No products are associated with this question.

1 Answer

Answer by Image Analyst on 22 Jul 2012
Edited by Image Analyst on 22 Jul 2012
Accepted answer

What have you tried so far? Did you try anything along the lines of

column3 = cell2mat(A(:,3))
goodIndexes = ~isnan(column3)
goodColumn3 = column3(goodIndexes)
mean3 = mean(goodColumn3)
std3 = std(goodColumn3)
lowerLimit = mean3 - std3
upperLimit = mean3 + std3
outliers = column3 < lowerLimit | column3 > upperLimit
outlierIndexes = find(outliers);
normalIndexes = find(~outliers);
x = (1:length(column3))';
stem(x(normalIndexes), column3(normalIndexes), 'b');
hold on;
grid on;
stem(x(outlierIndexes), column3(outlierIndexes), 'r');
% Enlarge figure to full screen.
set(gcf, 'units','normalized','outerposition',[0 0 1 1]);

and so on for the other columns?

6 Comments

salva on 22 Jul 2012

I searched on the internet but could not find a specific rule based on the MAD

Image Analyst on 22 Jul 2012

You first calculate the MAD. Then you multiply that by a factor that says how many of those you need to consider something an outlier, just like you chose 3 for sigma to say that if it were more than 3 sigma away from the mean it's an outlier. Same thing except use MAD instead of the standard deviation sigma. Like, if that data point's value is more than 2 (or 3 or 4 or whatever suits your situation) MADs away from the mean, then that data point is an outlier.

salva on 22 Jul 2012

thanks imager

Image Analyst

Contact us