We analyze the drawbacks of DBSCAN and its variants, and find the grid technique, which is used in Fast-DBSCAN and ρ-approximate DBSCAN, is almost useless in high dimensional data space. Because it ...usually yields considerable redundant distance computations. In order to tame these problems, two techniques are proposed: one is to use ϵ2-norm ball to identify Inner Core Blocks within which all points are core points, it has higher efficiency than grid technique for finding more core points at one time; the other is a fast approximate algorithm for judging whether two Inner Core Blocks are density-reachable from each other. Besides, cover tree is also used to accelerate the process of density computations. Based on the three techniques, an approximate approach, namely BLOCK-DBSCAN, is proposed for large scale data, which runs in about O(nlog (n)) expected time and obtains almost the same result as DBSCAN. BLOCK-DBSCAN has two versions, i.e., L2 version can work well for relatively high dimensional data, and L∞ version is suitable for high dimensional data. Experimental results show that BLOCK-DBSCAN is promising and outperforms NQDBSCAN, ρ-approximate DBSCAN and AnyDBC.
•The underlying idea is: point p and point q should have similar neighbors, provided p and q are close to each other; given a certain eps, the closer they are, the more similar their neighbors ...are.•NQ-DBSCAN is an exact algorithm that may return the same result as DBSCAN if the parameters are same. While ρ-Approximate DBSCAN is an approximate algorithm.•The best complexity of NQ-DBSCAN can be O(n), and the average complexity of NQ-DBSCAN is proved to be O(n log(n)) provided the parameters are properly chosen. While ρ-Approximate DBSCAN runs only in O(n2) in high dimension.•NQ-DBSCAN is suitable for clustering data with a lot of noise.
Clustering is an important technique to deal with large scale data which are explosively created in internet. Most data are high-dimensional with a lot of noise, which brings great challenges to retrieval, classification and understanding. No current existing approach is “optimal” for large scale data. For example, DBSCAN requires O(n2) time, Fast-DBSCAN only works well in 2 dimensions, and ρ-Approximate DBSCAN runs in O(n) expected time which needs dimension D to be a relative small constant for the linear running time to hold. However, we prove theoretically and experimentally that ρ-Approximate DBSCAN degenerates to an O(n2) algorithm in very high dimension such that 2D > > n. In this paper, we propose a novel local neighborhood searching technique, and apply it to improve DBSCAN, named as NQ-DBSCAN, such that a large number of unnecessary distance computations can be effectively reduced. Theoretical analysis and experimental results show that NQ-DBSCAN averagely runs in O(n*log(n)) with the help of indexing technique, and the best case is O(n) if proper parameters are used, which makes it suitable for many realtime data.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) identifies high-density connected areas as clusters, so that it has advantages in discovering arbitrary-shaped clusters. However, ...it has difficulty in adjusting parameters and since it needs to scan all data points in turn, its time complexity is O(n2). Granular-ball (GB) is a coarse grained representation of data. It is on the basis of the assumption that an object and its local neighbors have similar distribution and they have high possibility of belonging to the same class. It has been introduced into supervised learning by Xia et al. to improve the efficiency of supervised learning. Inspired by the idea of granular-ball, we introduce it into unsupervised learning and use it to improve the efficiency of DBSCAN, called GB-DBSCAN. The main idea of the proposed algorithm GB-DBSCAN is to employ granular-ball to represent a set of data points and then clustering on granular-balls, instead of the data points. Firstly, we use k-nearest neighbors (KNN) to generate granular-balls, which is a bottom-up strategy, and describe granular-balls according to their centers and radius. Then, the granular-balls are divided into Core-GBs and Non-Core-GBs according to their density. After that, the Core-GBs are merged into clusters according to the idea of DBSCAN and the Non-Core-GBs are assigned to the appropriate clusters. Since the granular-balls' number is much smaller than the size of the objects in a dataset, the running time of DBSCAN is greatly reduced. By comparing with KNN-BLOCK DBSCAN, RNN-DBSCAN, DBSCAN, K-means, DP and SNN-DPC algorithms, the proposed algorithm can get similar or even better clustering result in much less running time.
The fuzzy-C-means (FCM) algorithm is one of the most famous fuzzy clus-tering algorithms, but it gets stuck in local optima. In addition, this algo-rithm requires the number of clusters. Also, the ...density-based spatial of the application with noise (DBSCAN) algorithm, which is a density-based clus-tering algorithm, unlike the FCM algorithm, should not be pre-numbered. If the clusters are specific and depend on the number of clusters, then it can determine the number of clusters. Another advantage of the DBSCAN clus-tering algorithm over FCM is its ability to cluster data of different shapes. In this paper, in order to overcome these limitations, a hybrid approach for clustering is proposed, which uses FCM and DBSCAN algorithms. In this method, the optimal number of clusters and the optimal location for the centers of the clusters are determined based on the changes that take place according to the data set in three phases by predicting the possibility of the problems stated in the FCM algorithm. With this improvement, the values of none of the initial parameters of the FCM algorithm are random, and in the first phase, it has been tried to replace these random values to the optimal in the FCM algorithm, which has a significant effect on the convergence of the algorithm because it helps to reduce iterations. The proposed method has been examined on the Iris flower and compared the results with basic FCM algorithm and another algorithm. Results shows the better performance of the proposed method.
Spatio-temporal data mining has been the talk of the day due to high availability of spatio-temporal data from varied sources in diverse fields. Through many tracking devices, huge amounts of ...spatio-temporal data are being generated. In epidemiology, diseases, patterns and trends attached can be explored taking advantage of methods such as spatio-temporal clustering to discover new knowledge. In this paper Spatio-Temporal Density Based Spatial Clustering of Applications with Noise (ST-DBSCAN) is implemented and analysed on a public health dataset. Upon the implementation, results are analysed, loopholes spotted and a fuzzy version of ST-DBSCAN is proposed. The method is successfully applied to find spatio-temporal clusters in Chicago West Nile Virus (WNV) surveillance data for the period 2007 to 2017.The drawbacks in the original ST-DBSCAN are identified and solutions are proposed. ST-DBSCAN is an extension of the original Density Based Spatial Clustering of Applications with Noise (DBSCAN).
KNN-BLOCK DBSCAN: Fast Clustering for Large-Scale Data Chen, Yewang; Zhou, Lida; Pei, Songwen ...
IEEE transactions on systems, man, and cybernetics. Systems,
2021-June, 2021-6-00, Letnik:
51, Številka:
6
Journal Article
Recenzirano
Large-scale data clustering is an essential key for big data problem. However, no current existing approach is "optimal" for big data due to high complexity, which remains it a great challenge. In ...this article, a simple but fast approximate DBSCAN, namely, KNN-BLOCK DBSCAN, is proposed based on two findings: 1) the problem of identifying whether a point is a core point or not is, in fact, a kNN problem and 2) a point has a similar density distribution to its neighbors, and neighbor points are highly possible to be the same type (core point, border point, or noise). KNN-BLOCK DBSCAN uses a fast approximate kNN algorithm, namely, FLANN, to detect core-blocks (CBs), noncore-blocks, and noise-blocks within which all points have the same type, then a fast algorithm for merging CBs and assigning noncore points to proper clusters is also invented to speedup the clustering process. The experimental results show that KNN-BLOCK DBSCAN is an effective approximate DBSCAN algorithm with high accuracy, and outperforms other current variants of DBSCAN, including <inline-formula> <tex-math notation="LaTeX">\rho </tex-math></inline-formula>-approximate DBSCAN and AnyDBC.
At SIGMOD 2015, an article was presented with the title "DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation" that won the conference's best paper award. In this technical correspondence, ...we want to point out some inaccuracies in the way DBSCAN was represented, and why the criticism should have been directed at the assumption about the performance of spatial index structures such as R-trees and not at an algorithm that can use such indexes. We will also discuss the relationship of DBSCAN performance and the indexability of the dataset, and discuss some heuristics for choosing appropriate DBSCAN parameters. Some indicators of bad parameters will be proposed to help guide future users of this algorithm in choosing parameters such as to obtain both meaningful results and good performance. In new experiments, we show that the new SIGMOD 2015 methods do not appear to offer practical benefits if the DBSCAN parameters are well chosen and thus they are primarily of theoretical interest. In conclusion, the original DBSCAN algorithm with effective indexes and reasonably chosen parameter values performs competitively compared to the method proposed by Gan and Tao.
•Propose the K-DBSCAN clustering method, which can get K clusters of arbitrary shapes.•The novel harmony search is presented to optimize the clustering parameters.•Apply the novel harmony search to ...DBSCAN to realize the K-DBSCAN clustering.
At present, the DBSCAN clustering algorithm has been commonly used principally due to its ability in discovering clusters with arbitrary shapes. When the cluster number K is predefined, though the partitional clustering methods can perform efficiently, they cannot process the non-convex clustering and easily fall into local optimum. Thereby the concept of K-DBSCAN clustering is proposed in this paper. But the basic DBSCAN has a crucial defect, that is, difficult to predict the suitable clustering parameters. Here, the well-known harmony search (HS) optimization algorithm is considered to deal with this problem. By modifying the original HS, the novel harmony search (novel-HS) is put forward, which can improve the accuracy of results as well as enhance the robustness of optimization. In K-DBSCAN, the novel-HS is used to optimize the clustering parameters of DBSCAN to obtain better clustering effect with the number of K classifications. Experimental results show that the designed clustering method has superior performance to others and can be successfully considered as a new clustering scheme for further research.
Signal recovery technology based on an improved density-based spatial clustering of applications with a noise algorithm is proposed for coherent optical communication systems. For an 80 Gb/s ...single-carrier polarization division multiplexing 16-quadrature amplitude modulation transmission system at 975 km, the Q-factor was up to 0.33 dB higher than that of traditional algorithms based on the minimum Euclidean distance. The experimental results show that this signal recovery technology can effectively compensate for nonlinear damage in coherent optical communication systems without requiring prior knowledge. It improves the transmission tolerance of optical fibre transmission systems and has good generalization ability.
dbscan : Fast Density-Based Clustering with R Hahsler, Michael; Piekenbrock, Matthew; Doran, Derek
Journal of statistical software,
10/2019, Letnik:
91, Številka:
1
Journal Article
Recenzirano
Odprti dostop
This article describes the implementation and use of the R package dbscan, which provides complete and fast implementations of the popular density-based clustering algorithm DBSCAN and the augmented ...ordering algorithm OPTICS. Package dbscan uses advanced open-source spatial indexing data structures implemented in C++ to speed up computation. An important advantage of this implementation is that it is up-to-date with several improvements that have been added since the original algorithms were publications (e.g., artifact corrections and dendrogram extraction methods for OPTICS). We provide a consistent presentation of the DBSCAN and OPTICS algorithms, and compare dbscan's implementation with other popular libraries such as the R package fpc, ELKI, WEKA, PyClustering, SciKit-Learn, and SPMF in terms of available features and using an experimental comparison.