We analyze the drawbacks of DBSCAN and its variants, and find the grid technique, which is used in Fast-DBSCAN and ρ-approximate DBSCAN, is almost useless in high dimensional data space. Because it ...usually yields considerable redundant distance computations. In order to tame these problems, two techniques are proposed: one is to use ϵ2-norm ball to identify Inner Core Blocks within which all points are core points, it has higher efficiency than grid technique for finding more core points at one time; the other is a fast approximate algorithm for judging whether two Inner Core Blocks are density-reachable from each other. Besides, cover tree is also used to accelerate the process of density computations. Based on the three techniques, an approximate approach, namely BLOCK-DBSCAN, is proposed for large scale data, which runs in about O(nlog (n)) expected time and obtains almost the same result as DBSCAN. BLOCK-DBSCAN has two versions, i.e., L2 version can work well for relatively high dimensional data, and L∞ version is suitable for high dimensional data. Experimental results show that BLOCK-DBSCAN is promising and outperforms NQDBSCAN, ρ-approximate DBSCAN and AnyDBC.
•The underlying idea is: point p and point q should have similar neighbors, provided p and q are close to each other; given a certain eps, the closer they are, the more similar their neighbors ...are.•NQ-DBSCAN is an exact algorithm that may return the same result as DBSCAN if the parameters are same. While ρ-Approximate DBSCAN is an approximate algorithm.•The best complexity of NQ-DBSCAN can be O(n), and the average complexity of NQ-DBSCAN is proved to be O(n log(n)) provided the parameters are properly chosen. While ρ-Approximate DBSCAN runs only in O(n2) in high dimension.•NQ-DBSCAN is suitable for clustering data with a lot of noise.
Clustering is an important technique to deal with large scale data which are explosively created in internet. Most data are high-dimensional with a lot of noise, which brings great challenges to retrieval, classification and understanding. No current existing approach is “optimal” for large scale data. For example, DBSCAN requires O(n2) time, Fast-DBSCAN only works well in 2 dimensions, and ρ-Approximate DBSCAN runs in O(n) expected time which needs dimension D to be a relative small constant for the linear running time to hold. However, we prove theoretically and experimentally that ρ-Approximate DBSCAN degenerates to an O(n2) algorithm in very high dimension such that 2D > > n. In this paper, we propose a novel local neighborhood searching technique, and apply it to improve DBSCAN, named as NQ-DBSCAN, such that a large number of unnecessary distance computations can be effectively reduced. Theoretical analysis and experimental results show that NQ-DBSCAN averagely runs in O(n*log(n)) with the help of indexing technique, and the best case is O(n) if proper parameters are used, which makes it suitable for many realtime data.
Smart Card (SC) systems have been increasingly adopted by public transit (PT) agencies all over the world, which facilitates not only fare collection but also PT service analyses and evaluations. ...Spatial clustering is one of the most important methods to investigate this big data in terms of activity locations, travel patterns, user behaviours, etc. Besides spatio-temporal analysis of the clusters provide further precision for detection of PT traveller activity locations and durations. This study focuses on investigation and comparison of the effectiveness of two density-based clustering algorithms, DBSCAN, and ST-DBSCAN. The numeric results are obtained using SC data (public bus system) from the metropolitan city of Konya, Turkey, and clustering algorithms are applied to a sample of this smart card data, and activity clusters are detected for the users. The results of the study suggested that ST-DBSCAN constitutes more compact clusters in both time and space for transportation researchers who want to accurately detect passengers’ individual activity regions using SC data.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) identifies high-density connected areas as clusters, so that it has advantages in discovering arbitrary-shaped clusters. However, ...it has difficulty in adjusting parameters and since it needs to scan all data points in turn, its time complexity is O(n2). Granular-ball (GB) is a coarse grained representation of data. It is on the basis of the assumption that an object and its local neighbors have similar distribution and they have high possibility of belonging to the same class. It has been introduced into supervised learning by Xia et al. to improve the efficiency of supervised learning. Inspired by the idea of granular-ball, we introduce it into unsupervised learning and use it to improve the efficiency of DBSCAN, called GB-DBSCAN. The main idea of the proposed algorithm GB-DBSCAN is to employ granular-ball to represent a set of data points and then clustering on granular-balls, instead of the data points. Firstly, we use k-nearest neighbors (KNN) to generate granular-balls, which is a bottom-up strategy, and describe granular-balls according to their centers and radius. Then, the granular-balls are divided into Core-GBs and Non-Core-GBs according to their density. After that, the Core-GBs are merged into clusters according to the idea of DBSCAN and the Non-Core-GBs are assigned to the appropriate clusters. Since the granular-balls' number is much smaller than the size of the objects in a dataset, the running time of DBSCAN is greatly reduced. By comparing with KNN-BLOCK DBSCAN, RNN-DBSCAN, DBSCAN, K-means, DP and SNN-DPC algorithms, the proposed algorithm can get similar or even better clustering result in much less running time.
In data mining, and statistics, anomaly detection is the process of finding data patterns (outcomes, values, or observations) that deviate from the rest of the other observations or outcomes. Anomaly ...detection is heavily used in solving real-world problems in many application domains, like medicine, finance , cybersecurity, banking, networking, transportation, and military surveillance for enemy activities, but not limited to only these fields. In this paper, we present an empirical study on unsupervised anomaly detection techniques such as Density-Based Spatial Clustering of Applications with Noise (DBSCAN), (DBSCAN++) (with uniform initialization, k-center initialization, uniform with approximate neighbor initialization, and $k$-center with approximate neighbor initialization), and $k$-means$--$ algorithms on six benchmark imbalanced data sets. Findings from our in-depth empirical study show that k-means-- is more robust than DBSCAN, and DBSCAN++, in terms of the different evaluation measures (F1-score, False alarm rate, Adjusted rand index, and Jaccard coefficient), and running time. We also observe that DBSCAN performs very well on data sets with fewer number of data points. Moreover, the results indicate that the choice of clustering algorithm can significantly impact the performance of anomaly detection and that the performance of different algorithms varies depending on the characteristics of the data. Overall, this study provides insights into the strengths and limitations of different clustering algorithms for anomaly detection and can help guide the selection of appropriate algorithms for specific applications.
The fuzzy-C-means (FCM) algorithm is one of the most famous fuzzy clus-tering algorithms, but it gets stuck in local optima. In addition, this algo-rithm requires the number of clusters. Also, the ...density-based spatial of the application with noise (DBSCAN) algorithm, which is a density-based clus-tering algorithm, unlike the FCM algorithm, should not be pre-numbered. If the clusters are specific and depend on the number of clusters, then it can determine the number of clusters. Another advantage of the DBSCAN clus-tering algorithm over FCM is its ability to cluster data of different shapes. In this paper, in order to overcome these limitations, a hybrid approach for clustering is proposed, which uses FCM and DBSCAN algorithms. In this method, the optimal number of clusters and the optimal location for the centers of the clusters are determined based on the changes that take place according to the data set in three phases by predicting the possibility of the problems stated in the FCM algorithm. With this improvement, the values of none of the initial parameters of the FCM algorithm are random, and in the first phase, it has been tried to replace these random values to the optimal in the FCM algorithm, which has a significant effect on the convergence of the algorithm because it helps to reduce iterations. The proposed method has been examined on the Iris flower and compared the results with basic FCM algorithm and another algorithm. Results shows the better performance of the proposed method.
Spatio-temporal data mining has been the talk of the day due to high availability of spatio-temporal data from varied sources in diverse fields. Through many tracking devices, huge amounts of ...spatio-temporal data are being generated. In epidemiology, diseases, patterns and trends attached can be explored taking advantage of methods such as spatio-temporal clustering to discover new knowledge. In this paper Spatio-Temporal Density Based Spatial Clustering of Applications with Noise (ST-DBSCAN) is implemented and analysed on a public health dataset. Upon the implementation, results are analysed, loopholes spotted and a fuzzy version of ST-DBSCAN is proposed. The method is successfully applied to find spatio-temporal clusters in Chicago West Nile Virus (WNV) surveillance data for the period 2007 to 2017.The drawbacks in the original ST-DBSCAN are identified and solutions are proposed. ST-DBSCAN is an extension of the original Density Based Spatial Clustering of Applications with Noise (DBSCAN).
KNN-BLOCK DBSCAN: Fast Clustering for Large-Scale Data Chen, Yewang; Zhou, Lida; Pei, Songwen ...
IEEE transactions on systems, man, and cybernetics. Systems,
2021-June, 2021-6-00, Letnik:
51, Številka:
6
Journal Article
Recenzirano
Large-scale data clustering is an essential key for big data problem. However, no current existing approach is "optimal" for big data due to high complexity, which remains it a great challenge. In ...this article, a simple but fast approximate DBSCAN, namely, KNN-BLOCK DBSCAN, is proposed based on two findings: 1) the problem of identifying whether a point is a core point or not is, in fact, a kNN problem and 2) a point has a similar density distribution to its neighbors, and neighbor points are highly possible to be the same type (core point, border point, or noise). KNN-BLOCK DBSCAN uses a fast approximate kNN algorithm, namely, FLANN, to detect core-blocks (CBs), noncore-blocks, and noise-blocks within which all points have the same type, then a fast algorithm for merging CBs and assigning noncore points to proper clusters is also invented to speedup the clustering process. The experimental results show that KNN-BLOCK DBSCAN is an effective approximate DBSCAN algorithm with high accuracy, and outperforms other current variants of DBSCAN, including <inline-formula> <tex-math notation="LaTeX">\rho </tex-math></inline-formula>-approximate DBSCAN and AnyDBC.
At SIGMOD 2015, an article was presented with the title “DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation” that won the conference’s best paper award. In this technical correspondence, ...we want to point out some inaccuracies in the way DBSCAN was represented, and why the criticism should have been directed at the assumption about the performance of spatial index structures such as R-trees and not at an algorithm that can use such indexes. We will also discuss the relationship of DBSCAN performance and the indexability of the dataset, and discuss some heuristics for choosing appropriate DBSCAN parameters. Some indicators of bad parameters will be proposed to help guide future users of this algorithm in choosing parameters such as to obtain both meaningful results and good performance. In new experiments, we show that the new SIGMOD 2015 methods do not appear to offer practical benefits if the DBSCAN parameters are well chosen and thus they are primarily of theoretical interest. In conclusion, the original DBSCAN algorithm with effective indexes and reasonably chosen parameter values performs competitively compared to the method proposed by Gan and Tao.
•Propose the K-DBSCAN clustering method, which can get K clusters of arbitrary shapes.•The novel harmony search is presented to optimize the clustering parameters.•Apply the novel harmony search to ...DBSCAN to realize the K-DBSCAN clustering.
At present, the DBSCAN clustering algorithm has been commonly used principally due to its ability in discovering clusters with arbitrary shapes. When the cluster number K is predefined, though the partitional clustering methods can perform efficiently, they cannot process the non-convex clustering and easily fall into local optimum. Thereby the concept of K-DBSCAN clustering is proposed in this paper. But the basic DBSCAN has a crucial defect, that is, difficult to predict the suitable clustering parameters. Here, the well-known harmony search (HS) optimization algorithm is considered to deal with this problem. By modifying the original HS, the novel harmony search (novel-HS) is put forward, which can improve the accuracy of results as well as enhance the robustness of optimization. In K-DBSCAN, the novel-HS is used to optimize the clustering parameters of DBSCAN to obtain better clustering effect with the number of K classifications. Experimental results show that the designed clustering method has superior performance to others and can be successfully considered as a new clustering scheme for further research.