Data clustering is an important tool in data mining, that helps to retrieve useful data from large amount of available data. In this digital era data is available in abundance, but finding useful ...data has become a challenging task. For this, data clustering is an effective and common approach where we can group data by seeing some pattern or inherent data similarity in one group. Clustering is an unsupervised learning method of linearly separable and nonlinearly separable clusters widely used for different nature of application 1. Data clustering finds application in classification of patterns in different areas such as artificial intelligence, summarization, learning, segmentation, speech recognition, pattern recognition, image segmentation, biology, marketing, data mining, modelling and system identification etc 52425. No one clustering technique can be said as best or better than other, because different clustering algorithms co-exists and are application specific. This paper majorly emphasises on critical review of clustering algorithms used in control systems, but a brief overview is also given about all major algorithms.
dbscan : Fast Density-Based Clustering with R Hahsler, Michael; Piekenbrock, Matthew; Doran, Derek
Journal of statistical software,
10/2019, Letnik:
91, Številka:
1
Journal Article
Recenzirano
Odprti dostop
This article describes the implementation and use of the R package dbscan, which provides complete and fast implementations of the popular density-based clustering algorithm DBSCAN and the augmented ...ordering algorithm OPTICS. Package dbscan uses advanced open-source spatial indexing data structures implemented in C++ to speed up computation. An important advantage of this implementation is that it is up-to-date with several improvements that have been added since the original algorithms were publications (e.g., artifact corrections and dendrogram extraction methods for OPTICS). We provide a consistent presentation of the DBSCAN and OPTICS algorithms, and compare dbscan's implementation with other popular libraries such as the R package fpc, ELKI, WEKA, PyClustering, SciKit-Learn, and SPMF in terms of available features and using an experimental comparison.
Hierarchical clustering plays a crucial role in real-world knowledge discovery and data mining applications. This powerful technique provides tree-shaped results that are typically considered data ...summaries. However, achieving well-organized outputs requires a challenging trade-off between computational complexity (both in time and space) and clustering accuracy, especially in big data scenarios. To address this challenge, we propose a novel agglomerative algorithm for hierarchical clustering. Our algorithm constructs tree-shaped subclusters using a nearest-neighbour chain search. Next, the proxy (root) for each subcluster is identified using a local density peak detection mechanism, which guides the subsequent aggregation. Additionally, we propose a non-parametric variant to facilitate the easy implementation of the algorithm in real-world applications. Comprehensive experimental studies on fourteen real-world and synthetic datasets demonstrate that our algorithm surpasses other benchmarks in terms of clustering accuracy, response time, and memory footprint in most cases. Notably, our proposed algorithm can handle up to two million data points on a personal computer, further verifying its cost-effectiveness.
•A novel agglomerative clustering algorithm based on local density peaks is proposed.•A non-parametric variant based on multi-scope cutoff distances is proposed.•A probabilistic analysis is done to establish the theoretical correctness.•Extensive experiments on real-world datasets verify the advantage of our approach.
We investigate the application of the Ordered Weighted Averaging (OWA) data fusion operator in agglomerative hierarchical clustering. The examined setting generalises the well-known single, complete ...and average linkage schemes. It allows to embody expert knowledge in the cluster merge process and to provide a much wider range of possible linkages. We analyse various families of weighting functions on numerous benchmark data sets in order to assess their influence on the resulting cluster structure. Moreover, we inspect the correction for the inequality of cluster size distribution – similar to the one in the Genie algorithm. Our results demonstrate that by robustifying the procedure with the Genie correction, we can obtain a significant performance boost in terms of clustering quality. This is particularly beneficial in the case of the linkages based on the closest distances between clusters, including the single linkage and its “smoothed” counterparts. To explain this behaviour, we propose a new linkage process called three-stage OWA which yields further improvements. This way we confirm the intuition that hierarchical cluster analysis should rather take into account a few nearest neighbours of each point, instead of trying to adapt to their non-local neighbourhood.
Many higher-education institutions have endeavored to understand students' characteristics in order to improve the quality of education. To this end, demographic information and questionnaire surveys ...have been used, and more recently, digital information from learning management systems and other sources has emerged for students' profiling. This study adopted a novel approach using semantic trajectory data created from smart card logs of campus buildings and class attendance records to investigate the relationship between students' trajectory patterns and academic performance. More than 4000 freshmen were observed per semester at the Songdo International Campus, Yonsei University, in South Korea during four semesters in 2016 and 2017. Dynamic time warping was newly adopted to calculate the similarities among student trajectories, and the similarities of students' trajectories were grouped by hierarchical clustering. Average grade point averages (GPAs) of the groups were evaluated and compared by major and gender. The results showed that the average GPAs were statistically different from each other in general, which confirmed the hypothesis that a student's trajectory differentiates a student's GPA. Furthermore, GPA was positively associated with students' degree of activeness in movement — the more accesses to campus facilities, the better the GPA. Besides, the differences in the average GPAs of the male groups were clearer than was the case for females, and the trajectory of the second semester better characterized an individual student. The study shows that a semantic trajectory pattern generated from location logs is a new and influential factor that can be utilized to understand students' characteristics in higher education and to predict their academic performances.
•The authors proposed students' semantic trajectory for student profiling.•Large datasets were collected from over 4000 students for two school years.•Dynamic time warping, hierarchical clustering, and ANOVA tests were conducted.•Students' semantic trajectory was proved to be a new and influential factor associated with academic performance.
Power system capacity-expansion models are typically intractable if every operating period is represented. This issue is normally overcome by using a subset of representative operating periods. For ...instance, representative operating hours can be selected by discretizing the load-duration curve, which captures the effect of load levels on system-operation costs. This approach is inappropriate if system-operating costs depend on parameters other than load (e.g., renewable-resource availability) or if there are important intertemporal operating constraints (e.g., generator-ramping limits). This paper proposes the use of representative operating days, which are selected using clustering, to surmount these issues. We propose two hierarchical clustering techniques, which are designed to capture the important statistical features of the parameters (e.g., load and renewable-resource availability), in selecting representative days. This includes temporal autocorrelations and correlations between different locations. A case study, which is based on the Texan power system, is used to demonstrate the techniques. We show that our proposed clustering techniques result in investment decisions that closely match those made using the full unclustered dataset.
Hierarchical clustering techniques help in building a tree-like structure called dendrogram from the data points which can be used to find the closest related data objects. This paper presents a ...novel hierarchical clustering technique which considers intuitionistic fuzzy sets to deal with the uncertainty present in the data. Instead of using traditional hamming distance or Euclidean distance measure to find the distance between the data points, it employs the probabilistic Euclidean distance measure to propose a novel clustering approach which we term as ‘Probabilistic Intuitionistic Fuzzy Hierarchical Clustering (PIFHC) Algorithm’. The proposed PIFHC algorithm considers probabilistic weights from the data to measure the distances between the data points. Clustering results over UCI datasets show that our proposed PIFHC algorithm gives better cluster accuracies than its existing counterparts. PIFHC efficiently provides improvements of 1%–3.5% in the clustering accuracy compared to other fuzzy hierarchical clustering algorithms for most of the datasets. We further provide experimental results with the real-world car dataset and the Listeria monocytogenes dataset for mouse susceptibility to demonstrate the practical efficacy of the proposed algorithm. For Listeria datasets as well, proposed PIFHC records 1.7% improvement against the state-of-the-art methods The dendrograms formed by the proposed PIFHC algorithm exhibits high cophenetic correlation coefficient with an improvement of 0.75% over others. We provide various AGNES methods to update the distance between merged clusters in the proposed PIFHC algorithm.
•This paper presents a novel hierarchical clustering approach based on intutionistic fuzzy sets.•The proposed approach is termed as ‘probabilistic intuitionistic fuzzy hierarchical clustering (PIFHC)” algorithm.•PIFHC employs probabilistic Euclidean distance measure with different probabilistic weights for its different components.•Also presents methods to compute the distances of the merged cluster from other clusters.•Conducts extensive experiments over a number of benchmark and real-world datasets to demonstrate PIFHC’s superiority over others.
•We introduce the MELD data model: a diffusion framework for multiscale clustering.•We show how cluster coherence and separation interact with diffusion in MELD clusterings.•We introduce the M-LUND ...multiscale clustering algorithm and guarantee its performance.•We guarantee that M-LUND recovers the MELD data model from many datasets.•We show M-LUND extracts latent multiscale structure in synthetic and real datasets.
Clustering algorithms partition a dataset into groups of similar points. The clustering problem is very general, and different partitions of the same dataset could be considered correct and useful. To fully understand such data, it must be considered at a variety of scales, ranging from coarse to fine. We introduce the Multiscale Environment for Learning by Diffusion (MELD) data model, which is a family of clusterings parameterized by nonlinear diffusion on the dataset. We show that the MELD data model precisely captures latent multiscale structure in data and facilitates its analysis. To efficiently learn the multiscale structure observed in many real datasets, we introduce the Multiscale Learning by Unsupervised Nonlinear Diffusion (M-LUND) clustering algorithm, which is derived from a diffusion process at a range of temporal scales. We provide theoretical guarantees for the algorithm's performance and establish its computational efficiency. Finally, we show that the M-LUND clustering algorithm detects the latent structure in a range of synthetic and real datasets.
•A strategy was developed to identify building typical electricity usage profiles.•This strategy consists of intra-building clustering and inter-building clustering.•This strategy can discover ...electricity usage behaviors of multiple buildings.•This strategy outperformed two single-step clustering strategies.
This paper presents a clustering-based strategy to identify typical daily electricity usage (TDEU) profiles of multiple buildings. Different from the majority of existing clustering strategies, the proposed strategy consists of two levels of clustering, i.e. intra-building clustering and inter-building clustering. The intra-building clustering used a Gaussian mixture model-based clustering to identify the TDEU profiles of each individual building. The inter-building clustering used an agglomerative hierarchical clustering to identify the TDEU profiles of multiple buildings based on the TDEU profiles identified for each individual building through intra-building clustering. The performance of this strategy was evaluated using two-year hourly electricity consumption data collected from 40 university buildings. The results showed that this strategy can discover useful information related to building electricity usage, including typical patterns of daily electricity usage (DEU) and periodical variation of DEU. It was also shown that this proposed strategy can identify additional electricity usage patterns with a less computational cost, in comparison to two single-step clustering strategies including a Partitioning Around Medoids-based clustering strategy and a hierarchical clustering strategy. The results obtained from this study could be potentially used to assist in improving energy performance of university buildings and other types of buildings.
Density Peak clustering (DPC) as a novel algorithm can fast identify density peaks. But it comes along with two drawbacks: its allocation strategy may produce some non-adjacent associations that may ...lead to poor clustering results and even cause the malfunction of its cluster center selection method to mistakenly identify cluster centers; it may perform poorly with its high complex O(n2) when comes to large-scale data. Herein, a fast hierarchical clustering of local density peaks via an association degree transfer method (FHC-LDP) is proposed. To avoid DPC’s drawbacks caused by non-adjacent associations, FHC-LDP only considers the association between neighbors and design an association degree transfer method to evaluate the association between points that are not neighbors. FHC-LDP can fast identify local density peaks as sub-cluster centers to generate sub-clusters automatically and evaluate the similarity between sub-clusters. Then, by analyzing the similarity of sub-cluster centers, a hierarchical structure of sub-clusters is built. FHC-LDP replaces DPC’s cluster center selection method with a bottom-up hierarchical approach to ensure sub-clusters in each cluster are most similar. In FHC-LDP, only neighbor information of data is required, so by using a fast KNN algorithm, FHC-LDP can run about O(nlog(n)). Experimental results demonstrate FHC-LDP is remarkably superior to traditional clustering algorithms and other variants of DPC in recognizing cluster structure and running speed.