NUK - logo
E-viri
Recenzirano Odprti dostop
  • A fast clustering algorithm...
    Chen, Yewang; Tang, Shengyu; Bouguila, Nizar; Wang, Cheng; Du, Jixiang; Li, HaiLin

    Pattern recognition, November 2018, 2018-11-00, Letnik: 83
    Journal Article

    •The underlying idea is: point p and point q should have similar neighbors, provided p and q are close to each other; given a certain eps, the closer they are, the more similar their neighbors are.•NQ-DBSCAN is an exact algorithm that may return the same result as DBSCAN if the parameters are same. While ρ-Approximate DBSCAN is an approximate algorithm.•The best complexity of NQ-DBSCAN can be O(n), and the average complexity of NQ-DBSCAN is proved to be O(n log(n)) provided the parameters are properly chosen. While ρ-Approximate DBSCAN runs only in O(n2) in high dimension.•NQ-DBSCAN is suitable for clustering data with a lot of noise. Clustering is an important technique to deal with large scale data which are explosively created in internet. Most data are high-dimensional with a lot of noise, which brings great challenges to retrieval, classification and understanding. No current existing approach is “optimal” for large scale data. For example, DBSCAN requires O(n2) time, Fast-DBSCAN only works well in 2 dimensions, and ρ-Approximate DBSCAN runs in O(n) expected time which needs dimension D to be a relative small constant for the linear running time to hold. However, we prove theoretically and experimentally that ρ-Approximate DBSCAN degenerates to an O(n2) algorithm in very high dimension such that 2D >  > n. In this paper, we propose a novel local neighborhood searching technique, and apply it to improve DBSCAN, named as NQ-DBSCAN, such that a large number of unnecessary distance computations can be effectively reduced. Theoretical analysis and experimental results show that NQ-DBSCAN averagely runs in O(n*log(n)) with the help of indexing technique, and the best case is O(n) if proper parameters are used, which makes it suitable for many realtime data.