•A strategy was developed to identify building typical electricity usage profiles.•This strategy consists of intra-building clustering and inter-building clustering.•This strategy can discover ...electricity usage behaviors of multiple buildings.•This strategy outperformed two single-step clustering strategies.
This paper presents a clustering-based strategy to identify typical daily electricity usage (TDEU) profiles of multiple buildings. Different from the majority of existing clustering strategies, the proposed strategy consists of two levels of clustering, i.e. intra-building clustering and inter-building clustering. The intra-building clustering used a Gaussian mixture model-based clustering to identify the TDEU profiles of each individual building. The inter-building clustering used an agglomerative hierarchical clustering to identify the TDEU profiles of multiple buildings based on the TDEU profiles identified for each individual building through intra-building clustering. The performance of this strategy was evaluated using two-year hourly electricity consumption data collected from 40 university buildings. The results showed that this strategy can discover useful information related to building electricity usage, including typical patterns of daily electricity usage (DEU) and periodical variation of DEU. It was also shown that this proposed strategy can identify additional electricity usage patterns with a less computational cost, in comparison to two single-step clustering strategies including a Partitioning Around Medoids-based clustering strategy and a hierarchical clustering strategy. The results obtained from this study could be potentially used to assist in improving energy performance of university buildings and other types of buildings.
Density Peak clustering (DPC) as a novel algorithm can fast identify density peaks. But it comes along with two drawbacks: its allocation strategy may produce some non-adjacent associations that may ...lead to poor clustering results and even cause the malfunction of its cluster center selection method to mistakenly identify cluster centers; it may perform poorly with its high complex O(n2) when comes to large-scale data. Herein, a fast hierarchical clustering of local density peaks via an association degree transfer method (FHC-LDP) is proposed. To avoid DPC’s drawbacks caused by non-adjacent associations, FHC-LDP only considers the association between neighbors and design an association degree transfer method to evaluate the association between points that are not neighbors. FHC-LDP can fast identify local density peaks as sub-cluster centers to generate sub-clusters automatically and evaluate the similarity between sub-clusters. Then, by analyzing the similarity of sub-cluster centers, a hierarchical structure of sub-clusters is built. FHC-LDP replaces DPC’s cluster center selection method with a bottom-up hierarchical approach to ensure sub-clusters in each cluster are most similar. In FHC-LDP, only neighbor information of data is required, so by using a fast KNN algorithm, FHC-LDP can run about O(nlog(n)). Experimental results demonstrate FHC-LDP is remarkably superior to traditional clustering algorithms and other variants of DPC in recognizing cluster structure and running speed.
Clustering is the partitioning of a set of objects into groups (clusters) so that objects within a group are more similar to each others than objects in different groups. Most of the clustering ...algorithms depend on some assumptions in order to define the subgroups present in a data set. As a consequence, the resulting clustering scheme requires some sort of evaluation as regards its validity.The evaluation procedure has to tackle difficult problems such as the quality of clusters, the degree with which a clustering scheme fits a specific data set and the optimal number of clusters in a partitioning. In the literature, a wide variety of indices have been proposed to find the optimal number of clusters in a partitioning of a data set during the clustering process. However, for most of indices proposed in the literature, programs are unavailable to test these indices and compare them.The R package NbClust has been developed for that purpose. It provides 30 indices which determine the number of clusters in a data set and it offers also the best clustering scheme from different results to the user. In addition, it provides a function to perform k-means and hierarchical clustering with different distance measures and aggregation methods. Any combination of validation indices and clustering methods can be requested in a single function call. This enables the user to simultaneously evaluate several clustering schemes while varying the number of clusters, to help determining the most appropriate number of clusters for the data set of interest.
In the structure of nature, we believe that there is an underlying knowledge in all the phenomena we wish to understand. Mainly in the area of epidemiology we often tend to seek the structure of the ...data obtained, pattern of the disease, nature or cause of its emergence among living organisms. Sometimes, we could see the outbreak of disease is ambiguous and the exact cause of the disease is unknown. A significant number of algorithms and methods are available for clustering disease data. We could see that literature has no traces of including indeterminacy or vagueness in data which has to be much concentrated in epidemiological field. This study analyzes the attack of dengue in 26 districts of Sri Lanka for the period of seven years from 2012 to 2018. Clusters with low risk, medium risk and high risk areas affected by dengue are identified. In this paper, we propose a new algorithm called Neutrosophic-Fuzzy Hierarchical Clustering algorithm (NFHC) that includes indeterminacy. Proposed algorithm is compared with fuzzy hierarchical clustering algorithm and hierarchical clustering algorithm. Finally the results are evaluated with the benchmarking indexes and the performance of the clustering algorithm is studied. NFHC has performed a way better than the other two algorithms. Keywords: Dengue; Hierarchical clustering; Fuzzy hierarchical clustering; Neutrosophic Logic
Width deviation is an important metric for evaluating the quality of a hot-rolled strip in steel production systems. This paper considers a width deviation prediction problem and proposes a ...Machine-learning and Genetic-algorithm-based Hybrid method named MGH to obtain a prediction model. Existing work mainly focuses on high prediction accuracy, while ignoring interpretability. This work aims to build a prediction model that can make a good trade-off between two industry-required criteria, i.e., prediction accuracy and interpretability. It first collects some process variables in a hot rolling process and includes them as well as some constructed variables in a feature pool. Then we propose MGH to find representative variables from it and build a prediction model. MGH results from the integration of hierarchical clustering, genetic algorithm, and generalized linear regression. In detail, hierarchical clustering is applied to divide variables into clusters. Genetic algorithm and generalized linear regression are innovatively combined to select a representative variable from each cluster and develop a prediction model. The computational experiments conducted on both industrial and public datasets show that the proposed method can effectively balance prediction accuracy and interpretability of its resulting model. It has better overall performance than the compared state-of-the-art models.
Many high-throughput biological data analyses require the calculation of large correlation matrices and/or clustering of a large number of objects. The standard R function for calculating Pearson ...correlation can handle calculations without missing values efficiently, but is inefficient when applied to data sets with a relatively small number of missing data. We present an implementation of Pearson correlation calculation that can lead to substantial speedup on data with relatively small number of missing entries. Further, we parallelize all calculations and thus achieve further speedup on systems where parallel processing is available. A robust correlation measure, the biweight midcorrelation, is implemented in a similar manner and provides comparable speed. The functions cor and bicor for fast Pearson and biweight midcorrelation, respectively, are part of the updated, freely available R package WGCNA.The hierarchical clustering algorithm implemented in R function hclust is an order n(3) (n is the number of clustered objects) version of a publicly available clustering algorithm (Murtagh 2012). We present the package flashClust that implements the original algorithm which in practice achieves order approximately n(2), leading to substantial time savings when clustering large data sets.
Clustering is an essential tool in data mining research and applications. It is the subject of active research in many fields of study, such as computer science, data science, statistics, pattern ...recognition, artificial intelligence, and machine learning. Several clustering techniques have been proposed and implemented, and most of them successfully find excellent quality or optimal clustering results in the domains mentioned earlier. However, there has been a gradual shift in the choice of clustering methods among domain experts and practitioners alike, which is precipitated by the fact that most traditional clustering algorithms still depend on the number of clusters provided a priori. These conventional clustering algorithms cannot effectively handle real-world data clustering analysis problems where the number of clusters in data objects cannot be easily identified. Also, they cannot effectively manage problems where the optimal number of clusters for a high-dimensional dataset cannot be easily determined. Therefore, there is a need for improved, flexible, and efficient clustering techniques. Recently, a variety of efficient clustering algorithms have been proposed in the literature, and these algorithms produced good results when evaluated on real-world clustering problems. This study presents an up-to-date systematic and comprehensive review of traditional and state-of-the-art clustering techniques for different domains. This survey considers clustering from a more practical perspective. It shows the outstanding role of clustering in various disciplines, such as education, marketing, medicine, biology, and bioinformatics. It also discusses the application of clustering to different fields attracting intensive efforts among the scientific community, such as big data, artificial intelligence, and robotics. This survey paper will be beneficial for both practitioners and researchers. It will serve as a good reference point for researchers and practitioners to design improved and efficient state-of-the-art clustering algorithms.
•Provide an up-to-date comprehensive review of the different clustering techniques .•Highlight novel and most recent practical applications areas of clustering.•Provide a convenient research path for new researchers.•Help experts develop new algorithms for emerging challenges in the research area.
Hierarchical agglomerative methods stand out as particularly effective and popular approaches for clustering data. Yet, these methods have not been systematically compared regarding the important ...issue of false positives while searching for clusters. A model of clusters involving a higher density nucleus surrounded by a transition, followed by outliers is adopted as a means to quantify the relevance of the obtained clusters and address the problem of false positives. Six traditional methodologies, namely the single, average, median, complete, centroid and Ward’s linkage criteria are compared with respect to the adopted model. Unimodal and bimodal datasets obeying uniform, gaussian, exponential and power-law distributions are considered for this comparison. The obtained results include the verification that many methods detect two clusters in unimodal data. The single-linkage method was found to be more resilient to false positives. Also, several methods detected clusters not corresponding directly to the nucleus.
•Six classical agglomerative clustering methods are compared regarding false positives.•The single-linkage led to fewer false-positives in unimodal distributions.•The single-linkage yielded clusters corresponding more closely to the nuclei.
Time series clustering serves as a potent data mining method, facilitating the analysis of an extensive array of time series data without the prerequisite of any prior knowledge. It finds ...wide-ranging use across various sectors, including but not limited to, financial and medical data analysis, and sensor data processing. Given the high dimensionality, non-linearity, and redundancy characteristics associated with time series, conventional clustering algorithms frequently fall short in yielding satisfactory results when directly applied to this kind of data. As such, there is a critical need to judiciously select suitable feature extraction methods and dimension reduction techniques. This paper introduces a time series clustering algorithm, drawing primarily from polynomial fitting derivative features as a wellspring for feature extraction to achieve effective clustering results. Initially, Hodrick Prescott (HP) filtering comes into play for the processing of raw time series data, thereby eliminating noise and redundancy. Subsequently, polynomial curve fitting (PCF) is applied to the data to derive a globally continuous function fitting this time series. Next, by securing multi-order derivative values via this function, the time series is transformed into a multi-order derivative feature sequence. Lastly, we designed a polynomial function derivative features-based dynamic time warping (PFD_DTW) algorithm for determining the distance between two equal or unequal granular length time series, and subsequently a hierarchical clustering method anchored on the PFD_DTW distances for time series clustering after computing interspecies distances. The effectiveness of this method is corroborated by experimental results obtained from several practical datasets.
•A novel method simultaneously optimizes the district heating system’s type, scope, and equipment.•The method uses genetic and clustering algorithms to explore the spatial scale’s impact on district ...heating systems.•Results offer valuable insights linking algorithms with practical district heating solutions.
The design of district heating systems often involves the consideration of the appropriate system scope, referred to as spatial scale. In order to achieve climate-neutral heating in neighborhoods, it is necessary to address spatial scale in the early planning stages. This study presents a novel optimization method aimed at sustainable dimensioning of the district heating system by optimizing the system type, scope, and equipment simultaneously. The method quantifies the spatial scale of targeted district heating system and assists in the decision between decentralized and centralized systems by using clustering algorithms and the genetic algorithm. The proposed approach was rigorously tested in a city-scale case study against centralized pipe network and standalone approaches. Our findings reveal that the proposed methodology outperforms traditional threshold methods and fixed-scope optimization through a three-group comparative study, culminating in a globally optimal solution for district heating systems.. Finally, the study offers insights into the intricate interplay between energy systems and district heating network scopes, underscoring the pivotal role of spatial scale considerations in the design of district heating systems toward climate-neutral paradigms.