Existing subspace clustering methods typically employ shallow models to estimate underlying subspaces of unlabeled data points and cluster them into corresponding groups. However, due to the limited ...representative capacity of the employed shallow models, those methods may fail in handling realistic data without the linear subspace structure. To address this issue, we propose a novel subspace clustering approach by introducing a new deep model - Structured AutoEncoder (StructAE). The StructAE learns a set of explicit transformations to progressively map input data points into nonlinear latent spaces while preserving the local and global subspace structure. In particular, to preserve local structure, the StructAE learns representations for each data point by minimizing reconstruction error with respect to itself. To preserve global structure, the StructAE incorporates a prior structured information by encouraging the learned representation to preserve specified reconstruction patterns over the entire data set. To the best of our knowledge, StructAE is one of the first deep subspace clustering approaches. Extensive experiments show that the proposed StructAE significantly outperforms 15 state-of-the-art subspace clustering approaches in terms of five evaluation metrics.
This paper suggests a novel clustering-based evolutionary algorithm for many-objective optimization problems. Its main idea is to classify the population into a number of clusters, which is expected ...to solve the difficulty of balancing convergence and diversity in high-dimensional objective space. The individuals showing high similarities on the vector angles are gathered into the same cluster, such that the population's distribution can be well portrayed by the clusters. To efficiently find these clusters, partitional clustering is first used to classify the union population into <inline-formula> <tex-math notation="LaTeX">{m} </tex-math></inline-formula> main clusters based on the <inline-formula> <tex-math notation="LaTeX">{m} </tex-math></inline-formula> axis vectors (<inline-formula> <tex-math notation="LaTeX">{m} </tex-math></inline-formula> is the number of objectives), and then hierarchical clustering is further run on these <inline-formula> <tex-math notation="LaTeX">{m} </tex-math></inline-formula> main clusters to get <inline-formula> <tex-math notation="LaTeX">{N} </tex-math></inline-formula> final clusters (<inline-formula> <tex-math notation="LaTeX">{N} </tex-math></inline-formula> is the population size and <inline-formula> <tex-math notation="LaTeX">{N>m} </tex-math></inline-formula>). At last, in environmental selection, one individual from each of <inline-formula> <tex-math notation="LaTeX">{N} </tex-math></inline-formula> clusters closest to the axis vectors is selected to maintain diversity, while one individual from each of the other clusters is preferred by a simple convergence indicator to ensure convergence. When tackling some well-known test problems with 5-15 objectives, extensive experiments validate the superiority of our algorithm over six competitive many-objective EAs, especially on problems with incomplete and irregular Pareto-optimal fronts.
Software module clustering is an unsupervised learning method used to cluster software entities (e.g., classes, modules, or files) of similar features. The obtained clusters may be used to study, ...analyze, and understand the structure and behavior of the software entities. Implementing software module clustering with optimal results is challenging. Accordingly, researchers have addressed many aspects of software module clustering in the last decade. Thus, it is essential to present research evidence that has been published in this area. In this study, 143 research papers that examined software module clustering from well-known literature databases were extensively reviewed to extract useful data. The obtained data were then used to answer several research questions regarding state-of-the-art clustering approaches, applications of clustering in software engineering, clustering process, clustering algorithms, and evaluation methods. Several research gaps and challenges in software module clustering are discussed in this paper to provide a useful reference for researchers in this field.
Active Clustering with Model-Based Uncertainty Reduction Caiming Xiong; Johnson, David M.; Corso, Jason J.
IEEE transactions on pattern analysis and machine intelligence,
2017-Jan.-1, 2017-01-00, 2017-1-1, 20170101, Volume:
39, Issue:
1
Journal Article
Peer reviewed
Open access
Semi-supervised clustering seeks to augment traditional clustering methods by incorporating side information provided via human expertise in order to increase the semantic meaningfulness of the ...resulting clusters. However, most current methods are passive in the sense that the side information is provided beforehand and selected randomly. This may require a large number of constraints, some of which could be redundant, unnecessary, or even detrimental to the clustering results. Thus in order to scale such semi-supervised algorithms to larger problems it is desirable to pursue an active clustering method-i.e., an algorithm that maximizes the effectiveness of the available human labor by only requesting human input where it will have the greatest impact. Here, we propose a novel online framework for active semi-supervised spectral clustering that selects pairwise constraints as clustering proceeds, based on the principle of uncertainty reduction. Using a first-order Taylor expansion, we decompose the expected uncertainty reduction problem into a gradient and a step-scale, computed via an application of matrix perturbation theory and cluster-assignment entropy, respectively. The resulting model is used to estimate the uncertainty reduction potential of each sample in the dataset. We then present the human user with pairwise queries with respect to only the best candidate sample. We evaluate our method using three different image datasets (faces, leaves and dogs), a set of common UCI machine learning datasets and a gene dataset. The results validate our decomposition formulation and show that our method is consistently superior to existing state-of-the-art techniques, as well as being robust to noise and to unknown numbers of clusters.
The similarity among samples and the discrepancy among clusters are two crucial aspects of image clustering. However, current deep clustering methods suffer from inaccurate estimation of either ...feature similarity or semantic discrepancy. In this paper, we present a Semantic Pseudo-labeling-based Image ClustEring (SPICE) framework, which divides the clustering network into a feature model for measuring the instance-level similarity and a clustering head for identifying the cluster-level discrepancy. We design two semantics-aware pseudo-labeling algorithms, prototype pseudo-labeling and reliable pseudo-labeling, which enable accurate and reliable self-supervision over clustering. Without using any ground-truth label, we optimize the clustering network in three stages: 1) train the feature model through contrastive learning to measure the instance similarity; 2) train the clustering head with the prototype pseudo-labeling algorithm to identify cluster semantics; and 3) jointly train the feature model and clustering head with the reliable pseudo-labeling algorithm to improve the clustering performance. Extensive experimental results demonstrate that SPICE achieves significant improvements (~10%) over existing methods and establishes the new state-of-the-art clustering results on six balanced benchmark datasets in terms of three popular metrics. Importantly, SPICE significantly reduces the gap between unsupervised and fully-supervised classification; e.g. there is only 2% (91.8% vs 93.8%) accuracy difference on CIFAR-10. Our code is made publicly available at https://github.com/niuchuangnn/SPICE .
Under the framework of spectral clustering, the key of subspace clustering is building a similarity graph, which describes the neighborhood relations among data points. Some recent works build the ...graph using sparse, low-rank, and ℓ 2 -norm-based representation, and have achieved the state-of-the-art performance. However, these methods have suffered from the following two limitations. First, the time complexities of these methods are at least proportional to the cube of the data size, which make those methods inefficient for solving the large-scale problems. Second, they cannot cope with the out-of-sample data that are not used to construct the similarity graph. To cluster each out-of-sample datum, the methods have to recalculate the similarity graph and the cluster membership of the whole data set. In this paper, we propose a unified framework that makes the representation-based subspace clustering algorithms feasible to cluster both the out-of-sample and the large-scale data. Under our framework, the large-scale problem is tackled by converting it as the out-of-sample problem in the manner of sampling, clustering, coding, and classifying. Furthermore, we give an estimation for the error bounds by treating each subspace as a point in a hyperspace. Extensive experimental results on various benchmark data sets show that our methods outperform several recently proposed scalable methods in clustering a large-scale data set.
Multi-view clustering has attracted increasing attention in multimedia, machine learning and data mining communities. As one kind of the essential multi-view clustering algorithm, multi-view subspace ...clustering (MVSC) becomes more and more popular due to its strong ability to reveal the intrinsic low dimensional clustering structure hidden across views. Despite superior clustering performance in various applications, we observe that existing MVSC methods directly fuse multi-view information in the similarity level by merging noisy affinity matrices ; and isolate the processes of affinity learning, multi-view information fusion and clustering . Both factors may cause insufficient utilization of multi-view information, leading to unsatisfying clustering performance. This paper proposes a novel consensus one-step multi-view subspace clustering (COMVSC) method to address these issues. Instead of directly fusing multiple affinity matrices, COMVSC optimally integrates discriminative partition-level information, which is helpful to eliminate noise among data. Moreover, the affinity matrices, consensus representation and final clustering labels matrix are learned simultaneously in a unified framework. By doing so, the three steps can negotiate with each other to best serve the clustering task, leading to improved performance. Accordingly, we propose an iterative algorithm to solve the resulting optimization problem. Extensive experiment results on benchmark datasets demonstrate the superiority of our method against other state-of-the-art approaches.
Due to the efficiency of learning relationships and complex structures hidden in data, graph-oriented methods have been widely investigated and achieve promising performance. Generally, in the field ...of multi-view learning, these algorithms construct informative graph for each view, on which the following clustering or classification procedure are based. However, in many real-world data sets, original data always contain noises and outlying entries that result in unreliable and inaccurate graphs, which cannot be ameliorated in the previous methods. In this paper, we propose a novel multi-view learning model which performs clustering/semi-supervised classification and local structure learning simultaneously. The obtained optimal graph can be partitioned into specific clusters directly. Moreover, our model can allocate ideal weight for each view automatically without explicit weight definition and penalty parameters. An efficient algorithm is proposed to optimize this model. Extensive experimental results on different real-world data sets show that the proposed model outperforms other state-of-the-art multi-view algorithms.
In multi-core in-order processing systems, only one core can be utilised when the instruction at the head of the instruction queue produces data input for the next instruction in the queue. Although, ...in-order processing has been studied in the past, the influence of data clustering, i.e., the extent to which subsequent instructions rely on each other's data, has been largely overlooked. Therefore, a queueing model is developed and closed-form formulae are provided for the stability condition and the average time before instructions are executed. These expressions clearly reflect that data clustering can have a devastating impact.
Full text
Available for:
FZAB, GIS, IJS, KILJ, NLZOH, NUK, OILJ, SBCE, SBMB, UL, UM, UPUK
The fuzzy c-means clustering algorithm is the most widely used soft clustering algorithm. In contrast to hard clustering, the cluster membership of data generated using the fuzzy c-means algorithm is ...ambiguous. Similar to hard clustering algorithms, the clustering results of the fuzzy c-means clustering algorithm are also suboptimal with varied performance depending on initial solutions. In this paper, a collaborative annealing fuzzy c-means algorithm is presented. To address the issue of ambiguity, the proposed algorithm leverages an annealing procedure to phase out the fuzzy cluster membership degree toward a crispy one by reducing the exponent gradually according to a cooling schedule. To address the issue of suboptimality, the proposed algorithm employs multiple fuzzy c-means modules to generate alternative clusters based on memberships repeatedly reinitialized using a metaheuristic rule. Experimental results on eight benchmark datasets are elaborated to demonstrate the superiority of the proposed algorithm to thirteen prevailing hard and soft algorithms in terms of internal and external cluster validity indices.