DO NOT USE 0.85 with OpenMP support; it has bugs; advise upgrade to 0.86.
C++ std::nth_element is an efficient algorithm for selecting a ranked element from a vector of data. Typically it is implemented as a variant of quickselect, AKA Hoare's Selection Algorithm. The mex-file in this package will run nth_element over a 2D array column-wise. See C++ documentation and http://en.wikipedia.org/wiki/Selection_algorithm for more information.
I have added (as of v0.84) the ability to operate on data in-place. This potentially saves an array copy so can be significantly more efficient. I see about another 2x speedup in my tests on some machines (seems largely dependent on chipset/memory bandwidth). Because it uses undocumented mex calls, this may break on future versions of Matlab. Please give it a try and send feedback.
As of version 0.86, you can also get back the indices for the partitioning by requesting a second return value. This is slower than running just the data partition though. Only for nth_element for the moment. Should be easy to implement for fast_median when I find time / get more requests.
I have also added (as of v0.86) OpenMP pragmas to support operating on multiple columns in parallel. See nth_element.m or below for instructions on compiling the mex files with OpenMP support.
One example calculation based on nth_element is also included, a mex-file for fast_median. In my benchmarks, fast_median is roughly twice as fast as MatLab's native median function. MatLab's median relies on sort, but sorting the entire input data to get the median is inefficient. Theoretical average complexity of fast_median is O(n), compared to best case complexity of O(n log n) for a full sort based approach.
Median calculations are particularly important in robust statistics, for example the median absolute deviation (MAD).
To install, unpack the zip, go to the directory from MatLab, and run:
> mex nth_element.cpp
> mex fast_median.cpp
To get parallel processing of independent columns, you need to compile with OpenMP support. See your compiler instructions. With GCC, e.g., you would do:
> mex nth_element.cpp CXXFLAGS="\$CXXFLAGS -fopenmp" LDFLAGS="\$LDFLAGS -fopenmp"
Then put the resulting binaries on your MatLab path.
Peter Li (2020). nth_element (https://www.mathworks.com/matlabcentral/fileexchange/29453-nth_element), MATLAB Central File Exchange. Retrieved .
Heh, the plot thickens; in R2012a median doesn't take integer inputs at all! Glad they decided to walk that one back, although the return 0 for empty input is a bad bug; I'll report it.
Really, it returns zero now? That sounds like a worse bug than before; in many cases zero can be a real median value. Empty output or throw exception seems like the only correct behavior for empty integer input.
I think rather than add a second input flag, I'll punt and suggest you just wrap fast_median with a check for empty input if you really need it to behave like default median. I don't think this should slow things down significantly, but let me know if it's not workable.
Although, I agree that it makes sense to return empty for empty input for practical purposes it would be useful to e.g. have a flag as second input in order to change the default return value in case of empty input. his is because I have a complex routine that relies on the default behaviour of median.
Also, on R2013a median(zeros(0,'int32')) returns zero (they corrected the bug).
I uploaded a patched version that returns empty arrays for empty inputs. Let me know if you are still interested in NaN for floating point.
Hi Oleg, thanks for catching this.
It's an interesting case actually; MathWorks has a bug in their code, in that they try to return NaN even if the type is not double. Try: median(zeros(0,1), 'int32').
Do you need it to return NaN for some reason? I would prefer to return empty array of the proper type for all cases, but if you really need NaN I could do that for floating point types and empty for ints.
The following crashes MATLAB on Win7 64bit R2013a:
The MATLAB median on the other hand returns NaN:
Can you please adapt your submission to handle empty arrays.
In addition to the compile-time bug mentioned by Yannis, 0.85 also has multithreading OpenMP bugs. Strongly advise everyone to upgrade to 0.86 (should be posted soon). DO NOT USE 0.85 with OpenMP turned on or you will risk race conditions.
@Yannis, yes sorry that looks like a bug. I forgot that I had hacked in the requested feature to have it optionally return the sorted indices as well as the sorted data. Some cobwebs in here... I'll fix it by tomorrow and upload a new version.
Hi, thanks for implementing this, I really need it. However, I cannot compile it in Matlab 2012b (compiler Microsoft Visual C++ 2010).
The errors mostly appear in line 101 of the _lib
nth_element_lib.cpp(101) : error C2466: cannot allocate an array of constant size 0
nth_element_lib.cpp(101) : error C2133: 'temp' : unknown size
nth_element_lib.cpp(101) : error C2057: expected constant expression nth_element_lib.cpp(131) : see reference to function template instantiation 'void nth_element_cols<float>(T *,unsigned int *,mwIndex,mwSize,mwSize)' being compiled with [ T=float ]
Well, this reply is very late, but Matthias I don't quite understand your issue. fast_median does work on the first dimension by default and seems to do the right thing with your example: med = fast_median(data(:));
The thing fast_median does not do is operate on row vectors, which is a common Matlab behavior I have not chosen to support; is that what you meant to ask about?
As a follow up on WurmD's comment: "all" Matlab functions work on the first dimension by default so I think a "replacement" method should do so too.
The current behavior of fast_median breaks things like:
med = median(data(:));
Currently I can't use fast_median in my code as I need to be able to change the filtering function and I don't want to code in some special treatment for fast_median... :/
Would that be possible?
Hi jianbo, I think this is the same as Martijn's request? I may have time to give this a try in the next month or two. Unfortunately I expect it to be significantly slower than just returning the partially-sorted output. This is a limitation of the way the C++ standard library is written and I'm not sure I can get around it. Bruno's version may have more flexibility to be efficient here.
Hi, I am interested to have index, rather of the n-th element output, thx.
Hi LI, this uses a very similar algorithm to built-in sort, except that if you are only trying to know the "nth highest element" it is wasteful to sort the entire data. So it is faster than sort for finding out the "nth highest element" as long as the data are fairly large (for shorter data it performs about equally to sort).
Of course, if you are trying to find many different "nth highest elements" out of the same dataset you might be better off sorting the whole thing. One way to improve on the performance in that case would be to use the output array from nth_element as the input to the next call to nth_element; the array will become more fully sorted each iteration so the performance will get better. If you use the latest version (about to upload 0.84) you can use the inplace version of nth_element to make this very efficient.
Another caveat is that if you are trying to get, for example, the highest or lowest value then of course a simple max() or min() is better. This holds true if you are trying to get, for example, only the top three values out of a large dataset too.
It's interesting. I am wondering if it is faster than built-in function sort in MATLAB in some sceanrios.
@Martijn: Good question; I believe I could implement this but it would require a significant rewrite. I doubt I'll have time in the short term.
If someone wants to take a stab at it, download my source and then change the call to std::nth_element so that instead of sorting a simple vector of numbers you are sorting a vector of number pairs where the first number of each pair is the value and the second number is the original index. You will also have to change the call to std:nth_element to the version that takes a Compare argument and write a custom Compare to simply take the pair and only use the first entry for sorting.
In the meantime, I recommend you look into Bruno Luong's implementation. He wrote the algorithm by hand in C, so probably has more flexibility to include features like this; it may already be included.
Nice work Peter. Is there a way to get out the (old) indices of the N ranked items?
Thanks for the suggestion. I'm currently sitting on some other suggested updates too so I will try to look through things and release an update next week.
Thank you for your work Peter,
I would suggest fast_median to receive the same arguments as median, namely "M = median(A,dim)"
for it to be trivially swappable with median, and ensure its eventual inclusion into standard MatLab.
Fixed version 0.83 has been uploaded, may take some time to post. I'll PM it to you to test. Thanks!
Yes, i am running them from that directory with all *.h and *.cpp.
Microsoft Visual C++ 2008 SP1
is my compiler
and btw, I'm using win 7 64-bit with matlab r2010b 64-bit.
Hmm, are you running these from the directory with the nth_element.h file? Can you tell me what compiler you're using?
both of them return errors:
nth_element_0.81\nth_element.h(79) : error C2561: 'run_nth_element' : function must return a value
Note Bruno Luong's very similar post here: http://www.mathworks.com/matlabcentral/fileexchange/23576-minmax-selection
His version is written in C. He wrote the partial quicksort algorithm himself and it looks like very nice work. I find that relying on the C++ standard library gives slightly faster performance, but your mileage may vary depending on your compiler.
Find the treasures in MATLAB Central and discover how the community can help you!Start Hunting!