Code covered by the BSD License  

Highlights from
nth_element

4.75

4.8 | 4 ratings Rate this file 21 Downloads (last 30 days) File Size: 10.4 KB File ID: #29453

nth_element

by

 

19 Nov 2010 (Updated )

MEX wrap of C++ nth_element. Plus fast_median, a faster median function. In-place and parallel.

| Watch this File

File Information
Description

DO NOT USE 0.85 with OpenMP support; it has bugs; advise upgrade to 0.86.

C++ std::nth_element is an efficient algorithm for selecting a ranked element from a vector of data. Typically it is implemented as a variant of quickselect, AKA Hoare's Selection Algorithm. The mex-file in this package will run nth_element over a 2D array column-wise. See C++ documentation and http://en.wikipedia.org/wiki/Selection_algorithm for more information.

I have added (as of v0.84) the ability to operate on data in-place. This potentially saves an array copy so can be significantly more efficient. I see about another 2x speedup in my tests on some machines (seems largely dependent on chipset/memory bandwidth). Because it uses undocumented mex calls, this may break on future versions of Matlab. Please give it a try and send feedback.

As of version 0.86, you can also get back the indices for the partitioning by requesting a second return value. This is slower than running just the data partition though. Only for nth_element for the moment. Should be easy to implement for fast_median when I find time / get more requests.

I have also added (as of v0.86) OpenMP pragmas to support operating on multiple columns in parallel. See nth_element.m or below for instructions on compiling the mex files with OpenMP support.

One example calculation based on nth_element is also included, a mex-file for fast_median. In my benchmarks, fast_median is roughly twice as fast as MatLab's native median function. MatLab's median relies on sort, but sorting the entire input data to get the median is inefficient. Theoretical average complexity of fast_median is O(n), compared to best case complexity of O(n log n) for a full sort based approach.

Median calculations are particularly important in robust statistics, for example the median absolute deviation (MAD).

To install, unpack the zip, go to the directory from MatLab, and run:
  > mex nth_element.cpp
  > mex fast_median.cpp

To get parallel processing of independent columns, you need to compile with OpenMP support. See your compiler instructions. With GCC, e.g., you would do:
  > mex nth_element.cpp CXXFLAGS="\$CXXFLAGS -fopenmp" LDFLAGS="\$LDFLAGS -fopenmp"

Then put the resulting binaries on your MatLab path.

MATLAB release MATLAB 7.10 (R2010a)
Other requirements MEX setup with a C++ compiler. OpenMP supporting compiler for parallel column processing.
Tags for This File   Please login to tag files.
Please login to add a comment or rating.
Comments and Ratings (26)
17 Nov 2013 Peter Li

Heh, the plot thickens; in R2012a median doesn't take integer inputs at all! Glad they decided to walk that one back, although the return 0 for empty input is a bad bug; I'll report it.

16 Nov 2013 Peter Li

Really, it returns zero now? That sounds like a worse bug than before; in many cases zero can be a real median value. Empty output or throw exception seems like the only correct behavior for empty integer input.

I think rather than add a second input flag, I'll punt and suggest you just wrap fast_median with a check for empty input if you really need it to behave like default median. I don't think this should slow things down significantly, but let me know if it's not workable.

16 Nov 2013 Oleg Komarov

Although, I agree that it makes sense to return empty for empty input for practical purposes it would be useful to e.g. have a flag as second input in order to change the default return value in case of empty input. his is because I have a complex routine that relies on the default behaviour of median.

Also, on R2013a median(zeros(0,'int32')) returns zero (they corrected the bug).

14 Nov 2013 Peter Li

I uploaded a patched version that returns empty arrays for empty inputs. Let me know if you are still interested in NaN for floating point.

14 Nov 2013 Peter Li

Hi Oleg, thanks for catching this.

It's an interesting case actually; MathWorks has a bug in their code, in that they try to return NaN even if the type is not double. Try: median(zeros(0,1), 'int32').

Do you need it to return NaN for some reason? I would prefer to return empty array of the proper type for all cases, but if you really need NaN I could do that for floating point types and empty for ints.

14 Nov 2013 Oleg Komarov

The following crashes MATLAB on Win7 64bit R2013a:

fast_median(zeros(0,1))

The MATLAB median on the other hand returns NaN:

median(zeros(0,1))
ans =
NaN

Can you please adapt your submission to handle empty arrays.

26 Jul 2013 Peter Li

In addition to the compile-time bug mentioned by Yannis, 0.85 also has multithreading OpenMP bugs. Strongly advise everyone to upgrade to 0.86 (should be posted soon). DO NOT USE 0.85 with OpenMP turned on or you will risk race conditions.

26 Jul 2013 Peter Li

@Yannis, yes sorry that looks like a bug. I forgot that I had hacked in the requested feature to have it optionally return the sorted indices as well as the sorted data. Some cobwebs in here... I'll fix it by tomorrow and upload a new version.

26 Jul 2013 Yannis

Hi, thanks for implementing this, I really need it. However, I cannot compile it in Matlab 2012b (compiler Microsoft Visual C++ 2010).

The errors mostly appear in line 101 of the _lib

nth_element_lib.cpp(101) : error C2466: cannot allocate an array of constant size 0
nth_element_lib.cpp(101) : error C2133: 'temp' : unknown size
nth_element_lib.cpp(101) : error C2057: expected constant expression nth_element_lib.cpp(131) : see reference to function template instantiation 'void nth_element_cols<float>(T *,unsigned int *,mwIndex,mwSize,mwSize)' being compiled with [ T=float ]

26 Jul 2013 Yannis  
26 Jun 2013 Peter Li

Well, this reply is very late, but Matthias I don't quite understand your issue. fast_median does work on the first dimension by default and seems to do the right thing with your example: med = fast_median(data(:));

The thing fast_median does not do is operate on row vectors, which is a common Matlab behavior I have not chosen to support; is that what you meant to ask about?

15 Jun 2013 Jaehwan  
27 Feb 2012 Matthias Klemm

As a follow up on WurmD's comment: "all" Matlab functions work on the first dimension by default so I think a "replacement" method should do so too.
The current behavior of fast_median breaks things like:
med = median(data(:));

Currently I can't use fast_median in my code as I need to be able to change the filtering function and I don't want to code in some special treatment for fast_median... :/

Would that be possible?

07 Dec 2011 Peter Li

Hi jianbo, I think this is the same as Martijn's request? I may have time to give this a try in the next month or two. Unfortunately I expect it to be significantly slower than just returning the partially-sorted output. This is a limitation of the way the C++ standard library is written and I'm not sure I can get around it. Bruno's version may have more flexibility to be efficient here.

05 Dec 2011 jianbo

Hi, I am interested to have index, rather of the n-th element output, thx.

25 Aug 2011 Peter Li

Hi LI, this uses a very similar algorithm to built-in sort, except that if you are only trying to know the "nth highest element" it is wasteful to sort the entire data. So it is faster than sort for finding out the "nth highest element" as long as the data are fairly large (for shorter data it performs about equally to sort).

Of course, if you are trying to find many different "nth highest elements" out of the same dataset you might be better off sorting the whole thing. One way to improve on the performance in that case would be to use the output array from nth_element as the input to the next call to nth_element; the array will become more fully sorted each iteration so the performance will get better. If you use the latest version (about to upload 0.84) you can use the inplace version of nth_element to make this very efficient.

Another caveat is that if you are trying to get, for example, the highest or lowest value then of course a simple max() or min() is better. This holds true if you are trying to get, for example, only the top three values out of a large dataset too.

19 Jun 2011 LI

It's interesting. I am wondering if it is faster than built-in function sort in MATLAB in some sceanrios.

12 Jun 2011 Peter Li

@Martijn: Good question; I believe I could implement this but it would require a significant rewrite. I doubt I'll have time in the short term.

If someone wants to take a stab at it, download my source and then change the call to std::nth_element so that instead of sorting a simple vector of numbers you are sorting a vector of number pairs where the first number of each pair is the value and the second number is the original index. You will also have to change the call to std:nth_element to the version that takes a Compare argument and write a custom Compare to simply take the pair and only use the first entry for sorting.

In the meantime, I recommend you look into Bruno Luong's implementation. He wrote the algorithm by hand in C, so probably has more flexibility to include features like this; it may already be included.

12 Jun 2011 Martijn Steenwijk

Nice work Peter. Is there a way to get out the (old) indices of the N ranked items?

19 May 2011 Peter Li

Thanks for the suggestion. I'm currently sitting on some other suggested updates too so I will try to look through things and release an update next week.

19 May 2011 WurmD

Thank you for your work Peter,

I would suggest fast_median to receive the same arguments as median, namely "M = median(A,dim)"
for it to be trivially swappable with median, and ensure its eventual inclusion into standard MatLab.

Cheers,

22 Feb 2011 Peter Li

Fixed version 0.83 has been uploaded, may take some time to post. I'll PM it to you to test. Thanks!

18 Feb 2011 Dun Kirk

Yes, i am running them from that directory with all *.h and *.cpp.
Microsoft Visual C++ 2008 SP1
is my compiler
and btw, I'm using win 7 64-bit with matlab r2010b 64-bit.

17 Feb 2011 Peter Li

Hmm, are you running these from the directory with the nth_element.h file? Can you tell me what compiler you're using?

17 Feb 2011 Dun Kirk

mex fast_median.cpp
mex nth_element.cpp

both of them return errors:
nth_element_0.81\nth_element.h(79) : error C2561: 'run_nth_element' : function must return a value

08 Dec 2010 Peter Li

Note Bruno Luong's very similar post here: http://www.mathworks.com/matlabcentral/fileexchange/23576-minmax-selection

His version is written in C. He wrote the partial quicksort algorithm himself and it looks like very nice work. I find that relying on the C++ standard library gives slightly faster performance, but your mileage may vary depending on your compiler.

Updates
19 Nov 2010

Just updating licenses to match FreeBSD required by MatLab central. Also added a little documentation to source files.

22 Feb 2011

Fixed void pointer return to void so that fancier C++ compilers will not complain about missing return type.

25 Aug 2011

Added in-place editing.

27 Jun 2013

Added OpenMP pragmas to support operating on multiple columns in parallel. Tried to clean up includes without complicating mex build process.

27 Jun 2013

Corrected OpenMP compile help in nth_element.m and added similar info in description here.

26 Jul 2013

Bug fixes

15 Nov 2013

Bugfix for 0xN empty inputs.

Contact us