Consolidates common elements in x (may be ndimensional), aggregating corresponding y.
Consolidator has many uses. It was designed to solve an interpolation problem and a Delaunay problem, but I've added other uses too. It can serve as a tool which counts the number of replicates of each point, or as simply an implementation of unique(x,'rows'), but with a tolerance on that uniqueness.
Interpolation fails when there are replicate x values. Often it is recommended to form the mean of y for the replicate x values, eliminating the reps. Consolidator does this, and allows a tolerance on how close two values of x need be to be considered replicates. x may have multiple columns, i.e., it works on multidimensional data. x may even be a character array.
This same problem is seen both in interp1 and in griddata. Delaunay and delaunayn are also not robust when called with data that has replicates or near replicates.
Example usages:
% counting replicates
x = round(rand(100000,1)*2);
[xc,yc] = consolidator(x,[],'count');
[xc,yc]
ans =
0 25160
1 49844
2 24996
% aggregate y for the unique elements in x
% y = x(:,1) + x(:,2) + error
x = round(rand(100000,2)*2);
y = sum(x,2)+randn(size(x,1),1);
[xc,yc] = consolidator(x,y,'mean');
[xc,yc]
ans =
0 0 0.0054
0 1.0000 0.9905
0 2.0000 1.9895
1.0000 0 0.9957
1.0000 1.0000 1.9970
1.0000 2.0000 2.9988
2.0000 0 2.0136
2.0000 1.0000 2.9985
2.0000 2.0000 3.9891
Alternate usage using a function handle:
[xc,yc] = consolidator(x,y,@mean);
The aggregation can also be of many types. Min, max, mean, sum, std, var, median, prod, as well as geometric and harmonic means, plus the simple count option. Use of a function handle allows for
any aggregation the user may desire.
Consolidator is very different from accumarray.
Note that accumarray builds a potentially huge
array, filled with zeros. This array cannot be sparse in higher than 2 dimensions. Also, accumarray does not allow a tolerance. Its first argument MUST be an index. Finally, consolidator works on strings too.
1. Comments about converting complex x to its real and imaginary parts.


1.Replaced use of accumarray for consolidator13.


Documentation change 

It now works on (rectangular) character arrays. 

The newer code has been sped up, plus several
Many thanks are due to Urs Schwarz for his aid in

Inspired: Patch Slim (patchslim.m), CoBlade: Software for Analysis and Design of Composite Blades
Alessandro Masullo (view profile)
Alessandro Masullo (view profile)
Perfect. Thank you for this work
Sergei P. (view profile)
Matthias (view profile)
Hi John,
great submission!
I have one minor adjustment that would allow for individual tolerances, even if it's a bit ugly:
% consolidate elements of x.
% first shift, scale, and then ceil.
if numel(tol) < size(x,2)
tol = repmat(tol,1,size(x,2));
end
bgZ = tol>0;
xhat = x;
if any(bgZ)
xhat(:,bgZ) = x(:,bgZ)  repmat(min(x(:,bgZ),[],1)+tol(bgZ)*eps,n,1);
xhat(:,bgZ) = ceil(bsxfun(@rdivide,xhat(:,bgZ),tol(bgZ)));
end
Hope it helps someone.
Reza Farrahi Moghaddam (view profile)
Iris Hinrichs (view profile)
This function is exactly what I was looking for. Thanks for providing it, John!
I just discovered a minor bug:
It happened that I applied consolidate to x = 0.2 and y = [11 6.8].
[xc, yc] = consolidator(x,y, '@nanmean')
xc = 0.2
yc = 11
The last value of y is gone; the consolidator somehow "swallowed" it.
Although it does not make sense to consolidate an array that only has one row, the application of this function in this way can happen, especially when processing a lot of different arrays automatically.
Vagner (view profile)
It just work! Many thanks.
Faraz Oloumi (view profile)
Will (view profile)
Sorry John,
After I restarted everything started working great. Not sure what the problem was but doesn't seem to be related to the consolidator function.
Thanks for your attention
John D'Errico (view profile)
Will  Sorry, but you need to be more clear about your problem. I can't guess at the issue. Simplest is to send me the data that has a problem, as consulting in the comments is not my choice.
Will (view profile)
I need some help with this function. Seems to be working except a column of data I'm working with. Tried using 'mean' and @nanmean and both result in a column filled with only NaNs. There is numeric data present, I can see it in the y variable, and it appears to show up in ycon as a 0 until line 258 where:
ycon(count==1,:) = y(ec==1,:)
ycon becomes nothing but NaNs
Tung (view profile)
It works but it changes the order of rows.How can i merge duplicates but still keep the same order?
Thanks
Suti (view profile)
Yavor Kamer (view profile)
Dear John,
Regarding my previous comment, I found out that for that specific test the function performs relatively better if i change line 204
iu = [true;any(diff(xhat),2)];
to
iu = [true;any(abs(diff(xhat))>1,2)];
I also have a hunch that the sortrows (based on the 1st dimension column) on line 199 could be improved to take into account all possible column order permutations. I tried to do it but got into some complications and gave up.
Yavor Kamer (view profile)
Dear John,
Your consolidator function proved to be really indispensable for my Delaunay triangulations. However when I tried to test it with a set of points perturbed around 5 centers within an uncertainty radius I couldn't retrieve the initial centers.
unc=0.2;
mat_i=[1 0 0; 1 2 0; 0 3 0; 1 1 0; 2 1 2];
mat_all=mat_i;
for i=1:100
mat_all=[mat_all; mat_i+(rand(size(mat_i))0.5)*unc;];
end
mat_c = consolidator(mat_all,[],[],unc);
For one realization the last two rows of mat_c end up to be:
1.917 0.900 2.067
2.006 1.001 2.000
which is inconsistent with the tolerance (0.2). Is this an expected result or is there something wrong with my test?
Thank you
Ralph Spitzer (view profile)
Awsome function. Helped to solve my SQLlike "group by" problem. Consolidated my 2 million records in next to no time. Thank you!
ade77 (view profile)
Beautiful function. More beautiful when you use it in conjuction with cellfun. Exactly what I was looking for.
Mathworks, please be humble and include this function in MATLAB and pay appropriate fee for the creator.
Thanks John
Richard Crozier (view profile)
Amazing, yet another great code from John D'Errico, it seems like half the code I use will end up being written by him.
Brennan Smith (view profile)
Thank you very much! I've been looking all over for a way to identify unique rows and tally the number of repeats, and this is by far the easiest solution  it worked on my first attempt and the outputs were very easy to plot. Great job!
Gerry (view profile)
I just didn't realize "consolidator" can use other functions as its aggregation mode, in my case nanmedian etc. I have used "consolidator13" and couldn't get around the NaN data with it. Looks like the plain "consolidator" its the only one handling these other functions and I am sure it will do the trick for me. Thanks.
John D'Errico (view profile)
Well, to some extent, tools like nanmean can help. For example...
x = ceil(5*rand(10,1));
y = rand(10,1);
y(2) = nan;
[xc,yc] = consolidator(x,y,@nanmean)
xc =
1
2
3
4
5
yc =
0.66434
0.36668
0.42507
0.16971
0.54419
If x has nans in it though, things get sticky. Consolidator does not survive nans there. While I could repair this to work for 1d data, it would still fail for higher dimensions.
Gerry (view profile)
Please Help ...
I've been using consolidator with no problems and loving it. But I came across a data set with NaN values and it didn't work. I am getting a bunch of NaN even for the rows with real data. Is there any way around this? Thanks.
John D'Errico (view profile)
More digging shows that the behavior Christophe finds is a function of rounding, and of floating point arithmetic in general. But it is not something that I can make consolidator robust to, since variations at the least significant bit level will always cause problems in such a code.
This choice of a tolerance made by Christophe forces matlab/consolidator to perform a comparison between floating point numbers. With the tolerance set to exactly the difference between consecutive terms in the set provided, in some cases there MUST be a failure. PLEASE read this document:
http://docs.sun.com/source/8063568/ncg_goldberg.html
The use of floating point arithmetic in MATLAB causes this to fail. Here, using a version of consolidator with a subtly different internal test, I get the result that Christophe did:
consolidator([1,2,3,3.01,6]',[],[],1)
ans =
1.5
3
3.01
6
Yet now change the tolerance by only an infinitesimal amount, and we can get yet a different set of rounding results.
consolidator([1,2,3,3.01,6]',[],[],110*eps)
ans =
1
2.5
3.01
6
consolidator([1,2,3,3.01,6]',[],[],.9999999999999)
ans =
1
2
3.005
6
Again, these differences arise because of floating point arithmetic and the use of a tolerance that is so close to the stride between members of the set. This is not something that I can change, fix, repair, or code in a better way, because if I did make a change then some other set of data would cause the same problems.
I will argue that this is what I call the transitivity problem. When you specify a tolerance of 1, how is consolidator to resolve the set [1 2 3]? Are 1 and 2 to be lumped together? Or 2 and 3? Clearly, each of those pairs are the same to within a tolerance of 1. Yet we cannot lump them all into a single group, because 1 and 3 are not within the specified tolerance. Or should we? We might very logically argue to aggregate them down to any of these sets:
[1, 2, 3]
[1.5, 3]
[1, 2.5]
[1.5, 2.5]
[2]
The point is, beware of tests that compare floating point numbers. And beware of forcing code to make those tests. You can (and will) see virtually random results from doing so.
Finally, avoid use of a tolerance that is so close to the stride between elements of the set to be resolved. Consolidator is not designed to be a clustering tool, but to be a tool that will combine replicate values together and to survive small amounts of noise in the data. The tolerance allows minor variations in the numbers to be thus combined. If you try to use consolidator to cluster numbers together, it might succeed, but you can trip it up. And no matter what, the transitivity problem is important, and is not capable of resolution in an unambiguous manner, for ALL sets of data.
John
John D'Errico (view profile)
Christophe: My guess is your test used a variable where some of the numbers were not exact integers, so there was some floating point trash involved. This caused the results to be slightly different from what you expect, not the programming of consolidator.
I claim that to be true because when I try the specific example shown, pasted directly into MATLAB, I DO get the expected result. (I don't know what MATLAB release your test was done in, as there can sometimes be release issues too. A different CPU can also sometimes cause subtle differences, although I think that neither release or CPU here are the problem.)
consolidator([1,2,3,3.01,6]',[],[],1)
ans =
1
2
3.005
6
In general, consolidator uses a simple scheme to do the aggregation. This is necessary for speed, and so that it will work efficiently in higher dimensions. Note that there will always be what I'll call the "transitivity" problem. Thus, suppose you wish to perform consolidation on the set [1 1.5 2], with a tolerance of 0.75.
Clearly 1 and 1.5 are within the desired tolerance, so they should be grouped together. But so are 1.5 and 2, so they too should be grouped. Yet 1 and 2 cannot be grouped together.
The point is, there is no scheme which will resolve any possible set of data, aggregating the points into an unambiguously reduced set that all will agree is correct.
Christophe Lauwerys (view profile)
Thanks for this great contribution.
However, unless I misunderstood the functionality, I would expect
consolidator([1,2,3,3.01,6]',[],[],1)
to return
1
2
3.005
6
However, it returns
1.5000
3.0000
3.0100
6.0000
Is this desired behavior? Wouldn't it make sense to aggregate 3 and 3.01 instead of 1 and 2?
Michael Krause (view profile)
Oliver Woodford (view profile)
This isn't entirely an ACCUMARRAYN (which I agree there definitely needs to be) because the aggregator function must (I believe) return a single value per column of the input matrix. However, ACCUMARRAY has the wonderful property of being able to return a cell array:
C = accumarray(A, B, [], @(x) {x});
I have had cause to use this functionality many times. Any chance you might add it to CONSOLIDATOR, John?
Fortunately Loren's blog on accumarray links to here (as 'derivative work')! It's great the author took the time to publish preaccumarrayversions, too. Thank you!
Great and fast tool that I often use. The only thing I miss is that different tolerances apply to different columns of x. That'll be great.
The following is what I use to consolidating two list, and at the same time remove outliers in the YList. However it is calling consolidator three times.
Anyone has better idea?
**********************************
[xg, meany, Ind] = consolidator(xlist, ylist, 'mean');
[xg, stdy, Ind] = consolidator(xlist,ylist,'std');
notoutlier = find(abs(ylistmeany(Ind)) < 3*stdy(ind))
xlist = xlist(notoutlier);
ylist = ylist(notoutlier);
[xg, yg, Ind] = consolidator(xlist,ylist);
Great and fast tool I've been using for a while. But as for "20070908 Provided count information as a 4th output", the changed version seems not to be uploaded (18 October 2007)
It would be nice if the program would report how many elements fall into each cluster...
I've found this enormously handy to use. Excellent quick fix routine. Thank you for making it available.
A.L.  I've uploaded a new release of consolidator, fixing several other minor problems too as noted in the change history. When Matlab Central recognizes the new release in a few hours, please verify that consolidator13 now runs properly, as I cannot test it below R14. Thank you for identifying the problem. I'm sorry about the inconvenience.
Possible fix for previous comment (limited testing):
Replace line 201:
count=accumarray(eb,1).';
with:
count = diff(find([iu; true])).';
The R13 version uses accumarray which I dont think was available until R14 (I may be wrong), which is rather disappointing if you wanted to use consolidator to add accumarray functionality to an older release.
This is a really useful function. However, when the aggregation option is 'count', I find that Duane Hanselmann's mmrepeat is much faster
How is this different then accumarray?
This is exactly what I am looking for!
The hist() in MATLAB is too limited in functionality.
Much needed addition to MATLAB functionality! For those coming from the SQL world, used to doing massive aggregations and wildly complex rolling of data sets in simple SQL statements, you've probably been looking for this. One suggustion... it would be killer to tackle multiple aggregate types across multiple columns.
I agree with Urs. Would be ab excellent inclusion into future releases of Matlab.
Exactly what I needed for my work.
wow, what an (almost) flawlessly coded snippet of longawaited code! it's too bad, however, that there are two minuscule issues with it:
 the help section is TOO wordy (almost a novel by itself) and MUST be streamlined to the very essential, bare bone
 the name CONSOLIDATOR is distracting (and to most people rather obfuscating) and (really!) should be changed to ACCUMARRAYN, which is what it really does: extend the functionality of this otherwise great addition to the ML family of prepackaged functions (just consider how easily it preprocesses data for the statistics tbx's family of ANOVAs!)
altogether, this code is so essential one might even ask the dear people at TMW to include it (maybe even in mexed form) in one of the future releases
us