File Exchange

image thumbnail

Jorsorokin/HDBSCAN

version 1.0.0.0 (28.2 KB) by Jordan Sorokin
HDBSCAN - hierarchical density-based clustering for applications with noise

16 Downloads

Updated 30 Jun 2018

GitHub view license on GitHub

This is a MATLAB implementation of HDBSCAN, a hierarchical version of DBSCAN. HDBSCAN is described in Campello et al. 2013 and Campello et al. 2015. Please see the extensive documentation in the github repository. Suggestions for improvement / collaborations are encouraged!

Comments and Ratings (14)

with matlab 2017b i get 'Undefined function or variable 'hdbscan_fit'.'

ha ha

How to use your code? Please help me

ok should all be updated now. Sorry about the trouble, too many local copies of the repository has made file management more difficult than anticipated. Let me know if you run into issues; also, the compute_nearest_neighbors implementation is rather slow, just fyi

DraDri

Hi again, Thanks for putting this together and working to fix it. Unfortunately another error with a missing function:

Undefined function or variable 'find_blockIndex_range'.

Error in compute_nearest_neighbors (line 22)
[start,stop] = find_blockIndex_range( m,n,1e8 );

Error in compute_core_distances (line 38)
[~,dCore,D] = compute_nearest_neighbors( X,X,k-1 ); % slow but memory
conservative

Error in hdbscan_fit (line 86)
[dCore,D] = compute_core_distances( X,minpts );

Error in HDBSCAN/fit_model (line 195)
self.model = hdbscan_fit( self.data,...

Hi thank you for the comments. I pushed an update but forgot to include all of the necessary files. The repository should now be re-updated correctly

DraDri

I'm having the same error as AMB but working on window Matlab R2018a.

AMB

Hi

Looking further into the function "compute_core_distances"

in the code

if n > 15e3
[~,dCore,D] = compute_nearest_neighbors( X,X,k-1 ); % slow but memory conservative

the call to "compute_nearest_neighbors" will not work because "compute_nearest_neighbors" is not in the repository

and the call which gives rise to the error in my earlier post:

D = sparse( reshape( double( neighbors' ),n*k,1),...
reshape( repmat( 1:m,k,1 ),n*k,1 ),...
reshape( double( distances ),n*k,1 ),...
n,n );

the variable "distances" is not defined.

AMB

Hi

I am using '9.3.0.713579 (R2017b)' on a MAC

I note that there are 2 HDBSCAN.m in the file exchange and in the github respository.

One of the files is in the "functions" folder and the other is above it. They are different from one another, so I suggest that one of them be removed.

Now irrespective of which HDBSCAN.m I invoke. I get an error.

As an example, using the following code

X = [randn(100,2)+[-2 -2]; randn(100,2)+[2 2]; randn(100,2)+[0 0]];
clusterer = HDBSCAN( X );
clusterer.run_hdbscan( 10,20,[],0.85 );

I get an error

Error using reshape
To RESHAPE the number of elements must not change.

Error in compute_core_distances (line 49)
D = sparse( reshape( double( neighbors' ),n*k,1),...

Error in hdbscan_fit (line 86)
[dCore,D] = compute_core_distances( X,minpts );

What do you suggest?

Hi all, thank you for the comments. I originally forgot to include the "compute_pairwise_dist.m" file, which is now uploaded into the github repo. Let me know if you run into other issues.

Also, a trick I've found for best results is to set the "minpts" variable quite low (2-3) and adjust "minclustsize" according to your needs.

David Segev

Just does not work.
I got this error:
Training cluster hierarchy...
Data matrix size:
2000 points x 2 dimensions

Min # neighbors: 5
Min cluster size: 5
Skipping every 0 iteration

Undefined function or variable 'compute_pairwise_dist'.

Error in mutual_reachability (line 19)
mr = compute_pairwise_dist( X );

Error in hdbscan_fit (line 84)
mr = mutual_reachability( X,dCore );

Error in HDBSCAN/fit_model (line 191)
self.model = hdbscan_fit( self.data,...

Error in HDBSCAN/run_hdbscan (line 319)
self.fit_model( dEps );

You're my hero. This is great! Thank you for making this!

serg sh

Undefined function or variable 'compute_pairwise_dist'

Updates

1.0.0.0

Improved performance and memory usage for very large (>15,000 point) data sets. Also added "sparse_to_csr.m", a file by the author of "bfs.m" and "mst_prim.m" for converting sparse matrices

1.0.0.0

Updates to main algorithm for massive speedup (5-10x) by switching away from native matlab "graph" class during fitting. Prediction of new points is also faster and more accurate

1.0.0.0

Added "minClustNum" parameter to the HDBSCAN object, which helps realize child clusters in situations where the algorithm finds a few single large clusters but the user disagrees with the results.

MATLAB Release Compatibility
Created with R2017b
Compatible with any release
Platform Compatibility
Windows macOS Linux
Acknowledgements

Inspired by: gaimc : Graph Algorithms In Matlab Code