Density‐based clustering Campello, Ricardo J. G. B.; Kröger, Peer; Sander, Jörg ...
Wiley interdisciplinary reviews. Data mining and knowledge discovery,
March/April 2020, Letnik:
10, Številka:
2
Journal Article
Recenzirano
Odprti dostop
Clustering refers to the task of identifying groups or clusters in a data set. In density‐based clustering, a cluster is a set of data objects spread in the data space over a contiguous region of ...high density of objects. Density‐based clusters are separated from each other by contiguous regions of low density of objects. Data objects located in low‐density regions are typically considered noise or outliers. In this review article we discuss the statistical notion of density‐based clusters, classic algorithms for deriving a flat partitioning of density‐based clusters, methods for hierarchical density‐based clustering, and methods for semi‐supervised clustering. We conclude with some open challenges related to density‐based clustering.
This article is categorized under:
Technologies > Data Preprocessing
Ensemble Methods > Structure Discovery
Algorithmic Development > Hierarchies and Trees
Correspondence of a density threshold in a density distribution and the defined density‐based clusters in a sample.
GMC: Graph-Based Multi-View Clustering Wang, Hao; Yang, Yan; Liu, Bing
IEEE transactions on knowledge and data engineering,
06/2020, Letnik:
32, Številka:
6
Journal Article
Recenzirano
Odprti dostop
Multi-view graph-based clustering aims to provide clustering solutions to multi-view data. However, most existing methods do not give sufficient consideration to weights of different views and ...require an additional clustering step to produce the final clusters. They also usually optimize their objectives based on fixed graph similarity matrices of all views. In this paper, we propose a general G raph-based M ulti-view C lustering (GMC) to tackle these problems. GMC takes the data graph matrices of all views and fuses them to generate a unified graph matrix. The unified graph matrix in turn improves the data graph matrix of each view, and also gives the final clusters directly. The key novelty of GMC is its learning method, which can help the learning of each view graph matrix and the learning of the unified graph matrix in a mutual reinforcement manner. A novel multi-view fusion technique can automatically weight each data graph matrix to derive the unified graph matrix. A rank constraint without introducing a tuning parameter is also imposed on the graph Laplacian matrix of the unified matrix, which helps partition the data points naturally into the required number of clusters. An alternating iterative optimization algorithm is presented to optimize the objective function. Experimental results using both toy data and real-world data demonstrate that the proposed method outperforms state-of-the-art baselines markedly.
Power system capacity-expansion models are typically intractable if every operating period is represented. This issue is normally overcome by using a subset of representative operating periods. For ...instance, representative operating hours can be selected by discretizing the load-duration curve, which captures the effect of load levels on system-operation costs. This approach is inappropriate if system-operating costs depend on parameters other than load (e.g., renewable-resource availability) or if there are important intertemporal operating constraints (e.g., generator-ramping limits). This paper proposes the use of representative operating days, which are selected using clustering, to surmount these issues. We propose two hierarchical clustering techniques, which are designed to capture the important statistical features of the parameters (e.g., load and renewable-resource availability), in selecting representative days. This includes temporal autocorrelations and correlations between different locations. A case study, which is based on the Texan power system, is used to demonstrate the techniques. We show that our proposed clustering techniques result in investment decisions that closely match those made using the full unclustered dataset.
This article extends the expectation-maximization (EM) formulation for the Gaussian mixture model (GMM) with a novel weighted dissimilarity loss. This extension results in the fusion of two different ...clustering methods, namely, centroid-based clustering and graph clustering in the same framework in order to leverage their advantages. The fusion of centroid-based clustering and graph clustering results in a simple "soft" asynchronous hybrid clustering method. The proposed algorithm may start as a pure centroid-based clustering algorithm (e.g., <inline-formula> <tex-math notation="LaTeX">k </tex-math></inline-formula>-means), and as the time evolves, it may eventually and gradually turn into a pure graph clustering algorithm e.g., basic greedy asynchronous distributed interference avoidance (GADIA) (Babadi and Tarokh, 2010) as the algorithm converges and vice versa. The "hard" version of the proposed hybrid algorithm includes the standard Hopfield neural networks (and, thus, Bruck's Ln algorithm by (Bruck, 1990) and the Ising model in statistical mechanics), Babadi and Tarokh's basic GADIA in 2010, and the standard <inline-formula> <tex-math notation="LaTeX">{k} </tex-math></inline-formula>-means (Steinhaus, 1956), (MacQueen, 1967) i.e., the Lloyd algorithm (Lloyd, 1957, 1982) as its special cases. We call the "hard version" of the proposed clustering as "hybrid-nongreedy asynchronous clustering (H-NAC)." We apply the H-NAC to various clustering problems using well-known benchmark datasets. The computer simulations confirm the superior performance of the H-NAC compared to the <inline-formula> <tex-math notation="LaTeX">k </tex-math></inline-formula>-means clustering, <inline-formula> <tex-math notation="LaTeX">k </tex-math></inline-formula>-GADIA, spectral clustering, and a very recent clustering algorithm structured graph learning (SGL) by Kang et al. (2021), which represents one of the state-of-the-art clustering algorithms.
This paper focuses on scalability and robustness of spectral clustering for extremely large-scale datasets with limited resources. Two novel algorithms are proposed, namely, ultra-scalable spectral ...clustering (U-SPEC) and ultra-scalable ensemble clustering (U-SENC). In U-SPEC, a hybrid representative selection strategy and a fast approximation method for <inline-formula><tex-math notation="LaTeX">K</tex-math> <mml:math><mml:mi>K</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq1-2903410.gif"/> </inline-formula>-nearest representatives are proposed for the construction of a sparse affinity sub-matrix. By interpreting the sparse sub-matrix as a bipartite graph, the transfer cut is then utilized to efficiently partition the graph and obtain the clustering result. In U-SENC, multiple U-SPEC clusterers are further integrated into an ensemble clustering framework to enhance the robustness of U-SPEC while maintaining high efficiency. Based on the ensemble generation via multiple U-SEPC's, a new bipartite graph is constructed between objects and base clusters and then efficiently partitioned to achieve the consensus clustering result. It is noteworthy that both U-SPEC and U-SENC have nearly linear time and space complexity, and are capable of robustly and efficiently partitioning 10-million-level nonlinearly-separable datasets on a PC with 64 GB memory. Experiments on various large-scale datasets have demonstrated the scalability and robustness of our algorithms. The MATLAB code and experimental data are available at https://www.researchgate.net/publication/330760669 .