Code covered by the BSD License

### Highlights from Consolidator

5.0
5.0 | 25 ratings Rate this file 51 Downloads (last 30 days) File Size: 10.6 KB File ID: #8354 Version: 1.0

# Consolidator

### John D'Errico (view profile)

24 Aug 2005 (Updated )

Consolidates common elements in x (may be n-dimensional), aggregating corresponding y.

File Information
Description

Consolidator has many uses. It was designed to solve an interpolation problem and a Delaunay problem, but I've added other uses too. It can serve as a tool which counts the number of replicates of each point, or as simply an implementation of unique(x,'rows'), but with a tolerance on that unique-ness.

Interpolation fails when there are replicate x values. Often it is recommended to form the mean of y for the replicate x values, eliminating the reps. Consolidator does this, and allows a tolerance on how close two values of x need be to be considered replicates. x may have multiple columns, i.e., it works on multi-dimensional data. x may even be a character array.

This same problem is seen both in interp1 and in griddata. Delaunay and delaunayn are also not robust when called with data that has replicates or near replicates.

Example usages:

% counting replicates
x = round(rand(100000,1)*2);
[xc,yc] = consolidator(x,[],'count');
[xc,yc]
ans =
0 25160
1 49844
2 24996

% aggregate y for the unique elements in x
% y = x(:,1) + x(:,2) + error
x = round(rand(100000,2)*2);
y = sum(x,2)+randn(size(x,1),1);
[xc,yc] = consolidator(x,y,'mean');
[xc,yc]
ans =
0 0 0.0054
0 1.0000 0.9905
0 2.0000 1.9895
1.0000 0 0.9957
1.0000 1.0000 1.9970
1.0000 2.0000 2.9988
2.0000 0 2.0136
2.0000 1.0000 2.9985
2.0000 2.0000 3.9891

Alternate usage using a function handle:
[xc,yc] = consolidator(x,y,@mean);

The aggregation can also be of many types. Min, max, mean, sum, std, var, median, prod, as well as geometric and harmonic means, plus the simple count option. Use of a function handle allows for
any aggregation the user may desire.

Consolidator is very different from accumarray.
Note that accumarray builds a potentially huge
array, filled with zeros. This array cannot be sparse in higher than 2 dimensions. Also, accumarray does not allow a tolerance. Its first argument MUST be an index. Finally, consolidator works on strings too.

Acknowledgements
MATLAB release MATLAB 7.0.1 (R14SP1)
Other requirements Consolidator requires release 14 (or above) of matlab. For users of older matlab releases, I've included consolidator13 and consolidator11, which should work on older releases, although I have not tested it there.
31 Jul 2015 Alessandro Masullo

### Alessandro Masullo (view profile)

Perfect. Thank you for this work

04 Apr 2015 Sergei Paleichuk

### Sergei Paleichuk (view profile)

23 Feb 2015 Matthias

### Matthias (view profile)

Hi John,
great submission!

I have one minor adjustment that would allow for individual tolerances, even if it's a bit ugly:

% consolidate elements of x.
% first shift, scale, and then ceil.
if numel(tol) < size(x,2)
tol = repmat(tol,1,size(x,2));
end
bgZ = tol>0;
xhat = x;
if any(bgZ)
xhat(:,bgZ) = x(:,bgZ) - repmat(min(x(:,bgZ),[],1)+tol(bgZ)*eps,n,1);
xhat(:,bgZ) = ceil(bsxfun(@rdivide,xhat(:,bgZ),tol(bgZ)));
end

Hope it helps someone.

30 Dec 2014 Reza Farrahi Moghaddam

### Reza Farrahi Moghaddam (view profile)

04 Jul 2014 Iris Hinrichs

### Iris Hinrichs (view profile)

This function is exactly what I was looking for. Thanks for providing it, John!
I just discovered a minor bug:
It happened that I applied consolidate to x = 0.2 and y = [11 6.8].
[xc, yc] = consolidator(x,y, '@nanmean')
xc = 0.2
yc = 11

The last value of y is gone; the consolidator somehow "swallowed" it.
Although it does not make sense to consolidate an array that only has one row, the application of this function in this way can happen, especially when processing a lot of different arrays automatically.

07 May 2014 Vagner

### Vagner (view profile)

It just work! Many thanks.

31 Jan 2014 Faraz Oloumi

16 Nov 2013 Will

### Will (view profile)

Sorry John,

After I restarted everything started working great. Not sure what the problem was but doesn't seem to be related to the consolidator function.

Comment only
15 Nov 2013 John D'Errico

### John D'Errico (view profile)

Will - Sorry, but you need to be more clear about your problem. I can't guess at the issue. Simplest is to send me the data that has a problem, as consulting in the comments is not my choice.

Comment only
15 Nov 2013 Will

### Will (view profile)

I need some help with this function. Seems to be working except a column of data I'm working with. Tried using 'mean' and @nanmean and both result in a column filled with only NaNs. There is numeric data present, I can see it in the y variable, and it appears to show up in ycon as a 0 until line 258 where:

ycon(count==1,:) = y(ec==1,:)

ycon becomes nothing but NaNs

Comment only
24 Jul 2013 Tung

### Tung (view profile)

It works but it changes the order of rows.How can i merge duplicates but still keep the same order?

Thanks

Comment only
24 Apr 2013 Suti

### Suti (view profile)

04 Jun 2012 Yavor Kamer

### Yavor Kamer (view profile)

Dear John,

Regarding my previous comment, I found out that for that specific test the function performs relatively better if i change line 204
iu = [true;any(diff(xhat),2)];
to
iu = [true;any(abs(diff(xhat))>1,2)];

I also have a hunch that the sortrows (based on the 1st dimension column) on line 199 could be improved to take into account all possible column order permutations. I tried to do it but got into some complications and gave up.

Comment only
04 Jun 2012 Yavor Kamer

### Yavor Kamer (view profile)

Dear John,
Your consolidator function proved to be really indispensable for my Delaunay triangulations. However when I tried to test it with a set of points perturbed around 5 centers within an uncertainty radius I couldn't retrieve the initial centers.

unc=0.2;
mat_i=[1 0 0; 1 2 0; 0 3 0; 1 1 0; 2 1 2];
mat_all=mat_i;
for i=1:100
mat_all=[mat_all; mat_i+(rand(size(mat_i))-0.5)*unc;];
end
mat_c = consolidator(mat_all,[],[],unc);

For one realization the last two rows of mat_c end up to be:
1.917 0.900 2.067
2.006 1.001 2.000
which is inconsistent with the tolerance (0.2). Is this an expected result or is there something wrong with my test?

Thank you

Comment only
09 Feb 2012 Ralph Spitzer

### Ralph Spitzer (view profile)

Awsome function. Helped to solve my SQL-like "group by" problem. Consolidated my 2 million records in next to no time. Thank you!

Beautiful function. More beautiful when you use it in conjuction with cellfun. Exactly what I was looking for.

Mathworks, please be humble and include this function in MATLAB and pay appropriate fee for the creator.

Thanks John

17 May 2011 Richard Crozier

### Richard Crozier (view profile)

Amazing, yet another great code from John D'Errico, it seems like half the code I use will end up being written by him.

06 May 2011 Brennan Smith

### Brennan Smith (view profile)

Thank you very much! I've been looking all over for a way to identify unique rows and tally the number of repeats, and this is by far the easiest solution - it worked on my first attempt and the outputs were very easy to plot. Great job!

21 Nov 2010 Gerry

### Gerry (view profile)

I just didn't realize "consolidator" can use other functions as its aggregation mode, in my case nanmedian etc. I have used "consolidator13" and couldn't get around the NaN data with it. Looks like the plain "consolidator" its the only one handling these other functions and I am sure it will do the trick for me. Thanks.

Comment only
20 Nov 2010 John D'Errico

### John D'Errico (view profile)

Well, to some extent, tools like nanmean can help. For example...

x = ceil(5*rand(10,1));
y = rand(10,1);
y(2) = nan;

[xc,yc] = consolidator(x,y,@nanmean)
xc =
1
2
3
4
5
yc =
0.66434
0.36668
0.42507
0.16971
0.54419

If x has nans in it though, things get sticky. Consolidator does not survive nans there. While I could repair this to work for 1-d data, it would still fail for higher dimensions.

Comment only
20 Nov 2010 Gerry

### Gerry (view profile)

I've been using consolidator with no problems and loving it. But I came across a data set with NaN values and it didn't work. I am getting a bunch of NaN even for the rows with real data. Is there any way around this? Thanks.

12 Jul 2010 John D'Errico

### John D'Errico (view profile)

More digging shows that the behavior Christophe finds is a function of rounding, and of floating point arithmetic in general. But it is not something that I can make consolidator robust to, since variations at the least significant bit level will always cause problems in such a code.

This choice of a tolerance made by Christophe forces matlab/consolidator to perform a comparison between floating point numbers. With the tolerance set to exactly the difference between consecutive terms in the set provided, in some cases there MUST be a failure. PLEASE read this document:

http://docs.sun.com/source/806-3568/ncg_goldberg.html

The use of floating point arithmetic in MATLAB causes this to fail. Here, using a version of consolidator with a subtly different internal test, I get the result that Christophe did:

consolidator([1,2,3,3.01,6]',[],[],1)
ans =
1.5
3
3.01
6

Yet now change the tolerance by only an infinitesimal amount, and we can get yet a different set of rounding results.

consolidator([1,2,3,3.01,6]',[],[],1-10*eps)
ans =
1
2.5
3.01
6

consolidator([1,2,3,3.01,6]',[],[],.9999999999999)
ans =
1
2
3.005
6

Again, these differences arise because of floating point arithmetic and the use of a tolerance that is so close to the stride between members of the set. This is not something that I can change, fix, repair, or code in a better way, because if I did make a change then some other set of data would cause the same problems.

I will argue that this is what I call the transitivity problem. When you specify a tolerance of 1, how is consolidator to resolve the set [1 2 3]? Are 1 and 2 to be lumped together? Or 2 and 3? Clearly, each of those pairs are the same to within a tolerance of 1. Yet we cannot lump them all into a single group, because 1 and 3 are not within the specified tolerance. Or should we? We might very logically argue to aggregate them down to any of these sets:

[1, 2, 3]
[1.5, 3]
[1, 2.5]
[1.5, 2.5]
[2]

The point is, beware of tests that compare floating point numbers. And beware of forcing code to make those tests. You can (and will) see virtually random results from doing so.

Finally, avoid use of a tolerance that is so close to the stride between elements of the set to be resolved. Consolidator is not designed to be a clustering tool, but to be a tool that will combine replicate values together and to survive small amounts of noise in the data. The tolerance allows minor variations in the numbers to be thus combined. If you try to use consolidator to cluster numbers together, it might succeed, but you can trip it up. And no matter what, the transitivity problem is important, and is not capable of resolution in an unambiguous manner, for ALL sets of data.

John

Comment only
06 Jul 2010 John D'Errico

### John D'Errico (view profile)

Christophe: My guess is your test used a variable where some of the numbers were not exact integers, so there was some floating point trash involved. This caused the results to be slightly different from what you expect, not the programming of consolidator.

I claim that to be true because when I try the specific example shown, pasted directly into MATLAB, I DO get the expected result. (I don't know what MATLAB release your test was done in, as there can sometimes be release issues too. A different CPU can also sometimes cause subtle differences, although I think that neither release or CPU here are the problem.)

consolidator([1,2,3,3.01,6]',[],[],1)
ans =
1
2
3.005
6

In general, consolidator uses a simple scheme to do the aggregation. This is necessary for speed, and so that it will work efficiently in higher dimensions. Note that there will always be what I'll call the "transitivity" problem. Thus, suppose you wish to perform consolidation on the set [1 1.5 2], with a tolerance of 0.75.

Clearly 1 and 1.5 are within the desired tolerance, so they should be grouped together. But so are 1.5 and 2, so they too should be grouped. Yet 1 and 2 cannot be grouped together.

The point is, there is no scheme which will resolve any possible set of data, aggregating the points into an unambiguously reduced set that all will agree is correct.

Comment only
06 Jul 2010 Christophe Lauwerys

### Christophe Lauwerys (view profile)

Thanks for this great contribution.
However, unless I misunderstood the functionality, I would expect

consolidator([1,2,3,3.01,6]',[],[],1)

to return

1
2
3.005
6

However, it returns

1.5000
3.0000
3.0100
6.0000

Is this desired behavior? Wouldn't it make sense to aggregate 3 and 3.01 instead of 1 and 2?

Comment only
15 Jul 2009 Michael Krause

### Michael Krause (view profile)

05 Feb 2009 Oliver Woodford

### Oliver Woodford (view profile)

This isn't entirely an ACCUMARRAYN (which I agree there definitely needs to be) because the aggregator function must (I believe) return a single value per column of the input matrix. However, ACCUMARRAY has the wonderful property of being able to return a cell array:
C = accumarray(A, B, [], @(x) {x});
I have had cause to use this functionality many times. Any chance you might add it to CONSOLIDATOR, John?

Comment only
18 Sep 2008 Andres T.

Fortunately Loren's blog on accumarray links to here (as 'derivative work')! It's great the author took the time to publish pre-accumarray-versions, too. Thank you!

02 Jul 2008 w s

Great and fast tool that I often use. The only thing I miss is that different tolerances apply to different columns of x. That'll be great.

21 Apr 2008 chen li

The following is what I use to consolidating two list, and at the same time remove outliers in the YList. However it is calling consolidator three times.
Anyone has better idea?
**********************************
[xg, meany, Ind] = consolidator(xlist, ylist, 'mean');
[xg, stdy, Ind] = consolidator(xlist,ylist,'std');

notoutlier = find(abs(ylist-meany(Ind)) < 3*stdy(ind))
xlist = xlist(notoutlier);
ylist = ylist(notoutlier);
[xg, yg, Ind] = consolidator(xlist,ylist);

18 Oct 2007 Ronald Clinton

Great and fast tool I've been using for a while. But as for "2007-09-08 Provided count information as a 4th output", the changed version seems not to be uploaded (18 October 2007)

Comment only
07 Sep 2007 Sergei Koulayev

It would be nice if the program would report how many elements fall into each cluster...

09 May 2007 Lai Mun Woo

I've found this enormously handy to use. Excellent quick fix routine. Thank you for making it available.

21 Oct 2006 gabriel asaftei
18 Jan 2006 John D'Errico

A.L. - I've uploaded a new release of consolidator, fixing several other minor problems too as noted in the change history. When Matlab Central recognizes the new release in a few hours, please verify that consolidator13 now runs properly, as I cannot test it below R14. Thank you for identifying the problem. I'm sorry about the inconvenience.

Comment only
18 Jan 2006 A. L.

Possible fix for previous comment (limited testing):

Replace line 201:

count=accumarray(eb,1).';

with:

count = diff(find([iu; true])).';

Comment only
18 Jan 2006 A. L.

The R13 version uses accumarray which I dont think was available until R14 (I may be wrong), which is rather disappointing if you wanted to use consolidator to add accumarray functionality to an older release.

Comment only
03 Jan 2006 Iram Weinstein

This is a really useful function. However, when the aggregation option is 'count', I find that Duane Hanselmann's mmrepeat is much faster

23 Nov 2005 Robert Halter

How is this different then accumarray?

Comment only
15 Nov 2005 Liang Jin

This is exactly what I am looking for!
The hist() in MATLAB is too limited in functionality.

14 Nov 2005 Michael Ebstyne

Much needed addition to MATLAB functionality! For those coming from the SQL world, used to doing massive aggregations and wildly complex rolling of data sets in simple SQL statements, you've probably been looking for this. One suggustion... it would be killer to tackle multiple aggregate types across multiple columns.

30 Oct 2005 Evan Weller

I agree with Urs. Would be ab excellent inclusion into future releases of Matlab.

Exactly what I needed for my work.

31 Aug 2005 urs (us) schwarz

wow, what an (almost) flawlessly coded snippet of long-awaited code! it's too bad, however, that there are two minuscule issues with it:

- the help section is TOO wordy (almost a novel by itself) and MUST be streamlined to the very essential, bare bone

- the name CONSOLIDATOR is distracting (and to most people rather obfuscating) and (really!) should be changed to ACCUMARRAYN, which is what it really does: extend the functionality of this otherwise great addition to the ML family of pre-packaged functions (just consider how easily it preprocesses data for the statistics tbx's family of ANOVAs!)

altogether, this code is so essential one might even ask the dear people at TMW to include it (maybe even in mexed form) in one of the future releases
us

30 Aug 2005

The newer code has been sped up, plus several
minor bugs are fixed.

Many thanks are due to Urs Schwarz for his aid in
debugging drafts of my code and suggesting
alternatives in the code as well as the interface.

11 Oct 2005

It now works on (rectangular) character arrays.

23 Nov 2005

Documentation change

18 Jan 2006

1.Replaced use of accumarray for consolidator13.
2.Replaced a round with ceil to improve the clustering behavior of consolidator near the endpoints. 3. Allowed the user to supply row vectors. 4. Fixed a bug that caused failure when x is a scalar.

02 May 2006

1. Comments about converting complex x to its real and imaginary parts.
2. Fix bug with tolerance when reps are within the tolerance level at the minimum element.
3. Added name, e-mail, etc. to the code.