Code covered by the BSD License  

Highlights from
Consolidator

5.0

5.0 | 19 ratings Rate this file 77 Downloads (last 30 days) File Size: 10.6 KB File ID: #8354

Consolidator

by

 

24 Aug 2005 (Updated )

Consolidates common elements in x (may be n-dimensional), aggregating corresponding y.

| Watch this File

File Information
Description

Consolidator has many uses. It was designed to solve an interpolation problem and a Delaunay problem, but I've added other uses too. It can serve as a tool which counts the number of replicates of each point, or as simply an implementation of unique(x,'rows'), but with a tolerance on that unique-ness.

Interpolation fails when there are replicate x values. Often it is recommended to form the mean of y for the replicate x values, eliminating the reps. Consolidator does this, and allows a tolerance on how close two values of x need be to be considered replicates. x may have multiple columns, i.e., it works on multi-dimensional data. x may even be a character array.

This same problem is seen both in interp1 and in griddata. Delaunay and delaunayn are also not robust when called with data that has replicates or near replicates.

Example usages:

% counting replicates
x = round(rand(100000,1)*2);
[xc,yc] = consolidator(x,[],'count');
[xc,yc]
ans =
           0 25160
           1 49844
           2 24996

% aggregate y for the unique elements in x
% y = x(:,1) + x(:,2) + error
x = round(rand(100000,2)*2);
y = sum(x,2)+randn(size(x,1),1);
[xc,yc] = consolidator(x,y,'mean');
[xc,yc]
ans =
         0 0 0.0054
         0 1.0000 0.9905
         0 2.0000 1.9895
    1.0000 0 0.9957
    1.0000 1.0000 1.9970
    1.0000 2.0000 2.9988
    2.0000 0 2.0136
    2.0000 1.0000 2.9985
    2.0000 2.0000 3.9891

Alternate usage using a function handle:
[xc,yc] = consolidator(x,y,@mean);

The aggregation can also be of many types. Min, max, mean, sum, std, var, median, prod, as well as geometric and harmonic means, plus the simple count option. Use of a function handle allows for
any aggregation the user may desire.

Consolidator is very different from accumarray.
Note that accumarray builds a potentially huge
array, filled with zeros. This array cannot be sparse in higher than 2 dimensions. Also, accumarray does not allow a tolerance. Its first argument MUST be an index. Finally, consolidator works on strings too.

Acknowledgements

This file inspired Co Blade: Software For Analysis And Design Of Composite Blades and Patch Slim (Patchslim.M).

MATLAB release MATLAB 7.0.1 (R14SP1)
Other requirements Consolidator requires release 14 (or above) of matlab. For users of older matlab releases, I've included consolidator13 and consolidator11, which should work on older releases, although I have not tested it there.
Tags for This File   Please login to tag files.
Please login to add a comment or rating.
Comments and Ratings (36)
31 Jan 2014 Faraz Oloumi  
16 Nov 2013 Will

Sorry John,

After I restarted everything started working great. Not sure what the problem was but doesn't seem to be related to the consolidator function.

Thanks for your attention

15 Nov 2013 John D'Errico

Will - Sorry, but you need to be more clear about your problem. I can't guess at the issue. Simplest is to send me the data that has a problem, as consulting in the comments is not my choice.

15 Nov 2013 Will

I need some help with this function. Seems to be working except a column of data I'm working with. Tried using 'mean' and @nanmean and both result in a column filled with only NaNs. There is numeric data present, I can see it in the y variable, and it appears to show up in ycon as a 0 until line 258 where:

ycon(count==1,:) = y(ec==1,:)

ycon becomes nothing but NaNs

24 Jul 2013 Tung

It works but it changes the order of rows.How can i merge duplicates but still keep the same order?

Thanks

24 Apr 2013 Suti  
04 Jun 2012 Yavor Kamer

Dear John,

Regarding my previous comment, I found out that for that specific test the function performs relatively better if i change line 204
iu = [true;any(diff(xhat),2)];
to
iu = [true;any(abs(diff(xhat))>1,2)];

I also have a hunch that the sortrows (based on the 1st dimension column) on line 199 could be improved to take into account all possible column order permutations. I tried to do it but got into some complications and gave up.

04 Jun 2012 Yavor Kamer

Dear John,
Your consolidator function proved to be really indispensable for my Delaunay triangulations. However when I tried to test it with a set of points perturbed around 5 centers within an uncertainty radius I couldn't retrieve the initial centers.

unc=0.2;
mat_i=[1 0 0; 1 2 0; 0 3 0; 1 1 0; 2 1 2];
mat_all=mat_i;
for i=1:100
mat_all=[mat_all; mat_i+(rand(size(mat_i))-0.5)*unc;];
end
mat_c = consolidator(mat_all,[],[],unc);

For one realization the last two rows of mat_c end up to be:
1.917 0.900 2.067
2.006 1.001 2.000
which is inconsistent with the tolerance (0.2). Is this an expected result or is there something wrong with my test?

Thank you

09 Feb 2012 Ralph Spitzer

Awsome function. Helped to solve my SQL-like "group by" problem. Consolidated my 2 million records in next to no time. Thank you!

30 Sep 2011 ade77

Beautiful function. More beautiful when you use it in conjuction with cellfun. Exactly what I was looking for.

Mathworks, please be humble and include this function in MATLAB and pay appropriate fee for the creator.

Thanks John

17 May 2011 Richard Crozier

Amazing, yet another great code from John D'Errico, it seems like half the code I use will end up being written by him.

06 May 2011 Brennan Smith

Thank you very much! I've been looking all over for a way to identify unique rows and tally the number of repeats, and this is by far the easiest solution - it worked on my first attempt and the outputs were very easy to plot. Great job!

21 Nov 2010 Gerry

I just didn't realize "consolidator" can use other functions as its aggregation mode, in my case nanmedian etc. I have used "consolidator13" and couldn't get around the NaN data with it. Looks like the plain "consolidator" its the only one handling these other functions and I am sure it will do the trick for me. Thanks.

20 Nov 2010 John D'Errico

Well, to some extent, tools like nanmean can help. For example...

x = ceil(5*rand(10,1));
y = rand(10,1);
y(2) = nan;

[xc,yc] = consolidator(x,y,@nanmean)
xc =
1
2
3
4
5
yc =
0.66434
0.36668
0.42507
0.16971
0.54419

If x has nans in it though, things get sticky. Consolidator does not survive nans there. While I could repair this to work for 1-d data, it would still fail for higher dimensions.

20 Nov 2010 Gerry

Please Help ...
I've been using consolidator with no problems and loving it. But I came across a data set with NaN values and it didn't work. I am getting a bunch of NaN even for the rows with real data. Is there any way around this? Thanks.

12 Jul 2010 John D'Errico

More digging shows that the behavior Christophe finds is a function of rounding, and of floating point arithmetic in general. But it is not something that I can make consolidator robust to, since variations at the least significant bit level will always cause problems in such a code.

This choice of a tolerance made by Christophe forces matlab/consolidator to perform a comparison between floating point numbers. With the tolerance set to exactly the difference between consecutive terms in the set provided, in some cases there MUST be a failure. PLEASE read this document:

http://docs.sun.com/source/806-3568/ncg_goldberg.html

The use of floating point arithmetic in MATLAB causes this to fail. Here, using a version of consolidator with a subtly different internal test, I get the result that Christophe did:

consolidator([1,2,3,3.01,6]',[],[],1)
ans =
1.5
3
3.01
6

Yet now change the tolerance by only an infinitesimal amount, and we can get yet a different set of rounding results.

consolidator([1,2,3,3.01,6]',[],[],1-10*eps)
ans =
1
2.5
3.01
6

consolidator([1,2,3,3.01,6]',[],[],.9999999999999)
ans =
1
2
3.005
6

Again, these differences arise because of floating point arithmetic and the use of a tolerance that is so close to the stride between members of the set. This is not something that I can change, fix, repair, or code in a better way, because if I did make a change then some other set of data would cause the same problems.

I will argue that this is what I call the transitivity problem. When you specify a tolerance of 1, how is consolidator to resolve the set [1 2 3]? Are 1 and 2 to be lumped together? Or 2 and 3? Clearly, each of those pairs are the same to within a tolerance of 1. Yet we cannot lump them all into a single group, because 1 and 3 are not within the specified tolerance. Or should we? We might very logically argue to aggregate them down to any of these sets:

[1, 2, 3]
[1.5, 3]
[1, 2.5]
[1.5, 2.5]
[2]

The point is, beware of tests that compare floating point numbers. And beware of forcing code to make those tests. You can (and will) see virtually random results from doing so.

Finally, avoid use of a tolerance that is so close to the stride between elements of the set to be resolved. Consolidator is not designed to be a clustering tool, but to be a tool that will combine replicate values together and to survive small amounts of noise in the data. The tolerance allows minor variations in the numbers to be thus combined. If you try to use consolidator to cluster numbers together, it might succeed, but you can trip it up. And no matter what, the transitivity problem is important, and is not capable of resolution in an unambiguous manner, for ALL sets of data.

John

06 Jul 2010 John D'Errico

Christophe: My guess is your test used a variable where some of the numbers were not exact integers, so there was some floating point trash involved. This caused the results to be slightly different from what you expect, not the programming of consolidator.

I claim that to be true because when I try the specific example shown, pasted directly into MATLAB, I DO get the expected result. (I don't know what MATLAB release your test was done in, as there can sometimes be release issues too. A different CPU can also sometimes cause subtle differences, although I think that neither release or CPU here are the problem.)

consolidator([1,2,3,3.01,6]',[],[],1)
ans =
1
2
3.005
6

In general, consolidator uses a simple scheme to do the aggregation. This is necessary for speed, and so that it will work efficiently in higher dimensions. Note that there will always be what I'll call the "transitivity" problem. Thus, suppose you wish to perform consolidation on the set [1 1.5 2], with a tolerance of 0.75.

Clearly 1 and 1.5 are within the desired tolerance, so they should be grouped together. But so are 1.5 and 2, so they too should be grouped. Yet 1 and 2 cannot be grouped together.

The point is, there is no scheme which will resolve any possible set of data, aggregating the points into an unambiguously reduced set that all will agree is correct.

06 Jul 2010 Christophe Lauwerys

Thanks for this great contribution.
However, unless I misunderstood the functionality, I would expect

consolidator([1,2,3,3.01,6]',[],[],1)

to return

1
2
3.005
6

However, it returns

1.5000
3.0000
3.0100
6.0000

Is this desired behavior? Wouldn't it make sense to aggregate 3 and 3.01 instead of 1 and 2?

15 Jul 2009 Michael Krause  
05 Feb 2009 Oliver Woodford

This isn't entirely an ACCUMARRAYN (which I agree there definitely needs to be) because the aggregator function must (I believe) return a single value per column of the input matrix. However, ACCUMARRAY has the wonderful property of being able to return a cell array:
C = accumarray(A, B, [], @(x) {x});
I have had cause to use this functionality many times. Any chance you might add it to CONSOLIDATOR, John?

18 Sep 2008 Andres T.

Fortunately Loren's blog on accumarray links to here (as 'derivative work')! It's great the author took the time to publish pre-accumarray-versions, too. Thank you!

02 Jul 2008 w s

Great and fast tool that I often use. The only thing I miss is that different tolerances apply to different columns of x. That'll be great.

21 Apr 2008 chen li

The following is what I use to consolidating two list, and at the same time remove outliers in the YList. However it is calling consolidator three times.
Anyone has better idea?
**********************************
[xg, meany, Ind] = consolidator(xlist, ylist, 'mean');
[xg, stdy, Ind] = consolidator(xlist,ylist,'std');

notoutlier = find(abs(ylist-meany(Ind)) < 3*stdy(ind))
xlist = xlist(notoutlier);
ylist = ylist(notoutlier);
[xg, yg, Ind] = consolidator(xlist,ylist);

18 Oct 2007 Ronald Clinton

Great and fast tool I've been using for a while. But as for "2007-09-08 Provided count information as a 4th output", the changed version seems not to be uploaded (18 October 2007)

07 Sep 2007 Sergei Koulayev

It would be nice if the program would report how many elements fall into each cluster...

09 May 2007 Lai Mun Woo

I've found this enormously handy to use. Excellent quick fix routine. Thank you for making it available.

21 Oct 2006 gabriel asaftei  
18 Jan 2006 John D'Errico

A.L. - I've uploaded a new release of consolidator, fixing several other minor problems too as noted in the change history. When Matlab Central recognizes the new release in a few hours, please verify that consolidator13 now runs properly, as I cannot test it below R14. Thank you for identifying the problem. I'm sorry about the inconvenience.

18 Jan 2006 A. L.

Possible fix for previous comment (limited testing):

Replace line 201:

count=accumarray(eb,1).';

with:

count = diff(find([iu; true])).';

18 Jan 2006 A. L.

The R13 version uses accumarray which I dont think was available until R14 (I may be wrong), which is rather disappointing if you wanted to use consolidator to add accumarray functionality to an older release.

03 Jan 2006 Iram Weinstein

This is a really useful function. However, when the aggregation option is 'count', I find that Duane Hanselmann's mmrepeat is much faster

23 Nov 2005 Robert Halter

How is this different then accumarray?

15 Nov 2005 Liang Jin

This is exactly what I am looking for!
The hist() in MATLAB is too limited in functionality.

14 Nov 2005 Michael Ebstyne

Much needed addition to MATLAB functionality! For those coming from the SQL world, used to doing massive aggregations and wildly complex rolling of data sets in simple SQL statements, you've probably been looking for this. One suggustion... it would be killer to tackle multiple aggregate types across multiple columns.

30 Oct 2005 Evan Weller

I agree with Urs. Would be ab excellent inclusion into future releases of Matlab.

Exactly what I needed for my work.

31 Aug 2005 urs (us) schwarz

wow, what an (almost) flawlessly coded snippet of long-awaited code! it's too bad, however, that there are two minuscule issues with it:

- the help section is TOO wordy (almost a novel by itself) and MUST be streamlined to the very essential, bare bone

- the name CONSOLIDATOR is distracting (and to most people rather obfuscating) and (really!) should be changed to ACCUMARRAYN, which is what it really does: extend the functionality of this otherwise great addition to the ML family of pre-packaged functions (just consider how easily it preprocesses data for the statistics tbx's family of ANOVAs!)

altogether, this code is so essential one might even ask the dear people at TMW to include it (maybe even in mexed form) in one of the future releases
us

Updates
30 Aug 2005

The newer code has been sped up, plus several
minor bugs are fixed.

Many thanks are due to Urs Schwarz for his aid in
debugging drafts of my code and suggesting
alternatives in the code as well as the interface.

11 Oct 2005

It now works on (rectangular) character arrays.

23 Nov 2005

Documentation change

18 Jan 2006

1.Replaced use of accumarray for consolidator13.
2.Replaced a round with ceil to improve the clustering behavior of consolidator near the endpoints. 3. Allowed the user to supply row vectors. 4. Fixed a bug that caused failure when x is a scalar.

02 May 2006

1. Comments about converting complex x to its real and imaginary parts.
2. Fix bug with tolerance when reps are within the tolerance level at the minimum element.
3. Added name, e-mail, etc. to the code.

Contact us