How to make average of set of data?

11 views (last 30 days)
Amin Gan
Amin Gan on 10 Nov 2015
Commented: Amin Gan on 13 Nov 2015
I have two set of data (each 1000 numbers), A and B. Each number of A has a specific value in B (A(i,1)=B(i,1)). for example:
A=[ 1 ; 2 ; 1 ; 5 ; 10 ; 5 ]
B=[0.1 ; 0.5 ; 0.2 ; 0.3; 0.8 ; 0.9]
For A=1, B=0.1 & 0.2 >>>> so when A=1, B=0.3 (sum of the values) For A=2, B=0.5 >>>> so when A=2, B=0.5 Some of A values are repeated. I want to sum the repeated values of A for each value of B and then plot A Vs B. The following code has been used to calculate the summation:
[uA a b] = unique(A);
sB = arrayfun(@(x) (sum(B(b==x))), 1:numel(a));
X = [uA sB'];
This method works perfectly, but when I want to plot uA Vs sB, it gives me a very bad curve. The problem is that 1000 close data, sB does not give me the best answer because the uA has downward/upward/downward trend (comes from a set of data).for example at x=45 (attached file), I should have 4 different values for y and sum them up and get around y=11, but my data has only two values of exact 45 and the next close numbers are 44 and 46. therefore by using above method at uA=45, sB=5.6 which is incorrect (instead of 11).
please find attached file.
In summary, as shown in attached file, from x=0 to x=250, y has 2,3 or 4 values. for example, exact x=20 there might be 1,2,3 or 4, values.
I was wondering how to solve that problem ,is using average or median a good idea?
  1 Comment
dpb
dpb on 10 Nov 2015
Look at
doc ismembertol % needs recent release
to build a grouping variable for the accumulator based on the tolerance you think appropriate.

Sign in to comment.

Accepted Answer

arich82
arich82 on 13 Nov 2015
Edited: arich82 on 13 Nov 2015
[Edit to fix typo in image.]
This answer references your previous question, here.
I was able to extract the data from the curve.fig plot by renaming it .mat,
data = load('curve.mat');
x = data.hgS_070000.children(1).children.properties.XData;
y = data.hgS_070000.children(1).children.properties.YData;
Plotting the data shows that, over the first 10% of the x-data, there can be up to four duplicate y-values. However, because of the discreteness of the data and the fact that x isn't uniformly sampled, some of the "duplicate" values appear to missing.
Just for grins, I plotted the x and y data vs. their index, as well as diff(x) and diff(y): it appears that y comes from an analytical Gaussian curve, centered at 500, while x exhibits some odd discretization artifacts.
figure;
subplot(3, 1, 1);
hp1 = plot(x, y);
title('original data');
xlabel('x');
ylabel('y');
xlim([0, 2325]);
ylim([0, 12]);
subplot(3, 1, 2);
hp2 = plotyy(1:numel(x), x, 1:numel(y), y);
title('x and y vs. index')
xlabel('ind')
ylabel(hp2(1), 'x data');
ylabel(hp2(2), 'y data');
xlim(hp2(1), [0, 1000]);
xlim(hp2(2), [0, 1000]);
subplot(3, 1, 3);
hp3 = plotyy(2:numel(x), diff(x), 2:numel(y), diff(y));
title('diff(x) and diff(y) vs. index')
xlabel('ind')
ylabel(hp3(1), 'diff(x)');
ylabel(hp3(2), 'diff(y)');
xlim(hp3(1), [0, 1000]);
xlim(hp3(2), [0, 1000]);
In order to apply a summation or averaging filter to the y data, it's useful to fill in the gaps in x so that it is uniformly spaced, as addressed in your previous question. The difficulty lies in that x is non-monotonic, as evidenced by the 0 and negative values in the diff(x) plot. However, if we can get x to be uniformly spaced integers (with potentially repeated values), then we can convert x and y to parametric equations based on their common index, and interpolate that way, i.e. convert the ideal (non-Matlab form) y(x) to x(k) & y(k), and interpolate based on k, as in the previous question:
diffx = diff(x);
val = sign(diffx);
len = abs(diffx) + 1; % add one to include diff==0 as new phase; need to subtract off cumsum in ind
ind = [0, cumsum(len) - cumsum(abs(val))] + 1; % add one for 1-based indexing; note: x == X(ind);
n = ind(end); % note: numel(X) == sum(abs(diff(x))) + 1 == n;
mask = false(1, n-1);
mask(ind(1:end-1)) = true; % ind(end) == numel(X), not start of new phase
diffX = val(cumsum(mask)); % cumsum(mask) gives the rle phase number, i.e. index into val
X = x(1) + cumsum([0, diffX]);
K = 1:numel(X);
Y = interp1(K(ind), y, K);
We're now free to play with a few different smoothing techniques. As an example, I'll show the results of summing, averaging, and then smoothing the averaged data using a moving-average filter:
subs = (X - min(X)) + 1;
Ysum = accumarray(subs(:), Y).';
Ymean = accumarray(subs(:), Y, [], @mean).';
windowSize = 10;
Yfilt = filter(ones(1, windowSize), windowSize, Ymean);
Plotting the results,
figure;
subplot(3, 1, 1);
hp1 = plot(x, y, '-', X, Y, '--');
title('original and interpolated data');
xlabel('x');
ylabel('y');
legend({'orig x-y', 'interp X-Y'});
xlim([0, 2325]);
ylim([0, 12]);
subplot(3, 1, 2);
hp2 = plot(x, y);
hold all;
plot(X, Y, '--');
plot(min(X):max(X), Ysum, '--');
plot(min(X):max(X), Ymean, '--');
plot(min(X):max(X), Yfilt, '--');
title('various Y schemes')
xlabel('X');
ylabel('{y, Y, Ysum, Ymean, Yfilt}');
xlim([0, 2325]);
ylim([0, 12]);
legend({'orig', 'interp', 'sum', 'mean', 'moving avg. (filt)'});
subplot(3, 1, 3);
hp3 = plotyy(2:numel(X), diff(X), 2:numel(Y), diff(Y));
% hp3 = plotyy(2:numel(x), diff(x), 2:numel(y), diff(y));
title('diff(X) and diff(Y) vs. index')
xlabel('ind')
ylabel(hp3(1), 'diff(X)');
ylim(hp3(1), [-10, 10]);
ylabel(hp3(2), 'diff(Y) ');
In the first subplot, we see that the interpolated data (dashed line) exactly matches the original data. In the second, we see that none of the smoothing schemes are terribly pretty, though which is "best" will depend on your application. Note that all have a jump where the data becomes monotonic. (You could use a larger running average filter to further smooth the mean data; using filtfilt might be wise to prevent a phase lag. However, since y seems to be analytical, I wonder if there is an analytical method you should be considering...) In the third subplot, we see that the step size in x is now exactly +1, 0, or -1, as desired in the previous question; because of the interpolation, however, the step size in y is no longer the analytical derivative of the Gaussian, but some discretely filtered variation of it.
I hope this helps you better analyze your problem. In general, once you've applied the interpolation scheme to get X and Y, you should find it trivial to apply any desired smoothing algorithm, though again, thinking about an analytical solution might be wise.
Please accept this answer if it helps, or let me know in the comments if I've missed something. (Note: I might not get back to you for a couple of days.)
  1 Comment
Amin Gan
Amin Gan on 13 Nov 2015
Thank you so much for your complete and clear explanation. you are right, y axis comes from Gaussian distribution and first 10% of x data is non-monotonic it makes it difficult to get the smooth curve.
Therefore, I think finding a solution for question below might solve the problem (after filling the gap of the x and removed repeated neighbour value) .
Thanks for your time and help.

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!