•A refined k-nearest neighbor graph to reduce memory needed for spectral clustering.•The method is deterministic and performs consistently on independent executions.•The method scores higher than ASC ...methods in terms of clustering evaluation metrics.
Spectral clustering became a popular choice for data clustering for its ability of uncovering clusters of different shapes. However, it is not always preferable over other clustering methods due to its computational demands. One of the effective ways to bypass these computational demands is to perform spectral clustering on a subset of points (data representatives) then generalize the clustering outcome, this is known as approximate spectral clustering (ASC). ASC uses sampling or quantization to select data representatives. This makes it vulnerable to 1) performance inconsistency (since these methods have a random step either in initialization or training), 2) local statistics loss (because the pairwise similarities are extracted from data representatives instead of data points). We proposed a refined version of k-nearest neighbor graph, in which we keep data points and aggressively reduce number of edges for computational efficiency. Local statistics were exploited to keep the edges that do not violate the intra-cluster distances and nullify all other edges in the k-nearest neighbor graph. We also introduced an optional step to automatically select the number of clusters C. The proposed method was tested on synthetic and real datasets. Compared to ASC methods, the proposed method delivered a consistent performance despite significant reduction of edges.
Distance metric learning aims to deal with the data distribution by learning a suitable distance metric from the training instances. For distance metric learning, the optimization constraints can be ...constructed based on the similar and dissimilar instance pairs. The instance pairs are generated by selecting the nearest-neighbors for each training instance. However, most methods select the same and fixed nearest-neighbor number for different training instances, which may limit performance for learning distance metric. In this paper, we propose a nearest-neighbor search model for distance metric learning (NNS-DML), which is capable of constructing the metric optimization constraints by searching different optimal nearest-neighbor numbers for different training instances. Specifically, we formulate a nearest-neighbor search matrix to contain the nearest-neighbor correlations of all training instances. Using the search matrix, we can construct and weight the metric optimization constraints of each training instance, such that the influence of its irrelevant features for its corresponding similar and dissimilar instance pairs can be reduced. Moreover, we develop a k-free nearest-neighbor model for classification problems via the SVM solver, which can ignore the setting of k. Extensive experiments show that the proposed NNS-DML method outperforms the state-of-the-art distance metric learning methods.
k-Nearest neighbor (KNN) rule is a very simple and powerful classification algorithm. In this article, we propose a new KNN-based classifier, called the local mean-based pseudo nearest neighbor ...(LMPNN) rule. It is motivated by the local mean-based k-nearest neighbor (LMKNN) rule and the pseudo nearest neighbor (PNN) rule, with the aim of improving the classification performance. In the proposed LMPNN, the k nearest neighbors from each class are searched as the class prototypes, and then the local mean vectors of the neighbors are yielded. Subsequently, we attempt to find the local mean-based pseudo nearest neighbor per class by employing the categorical k local mean vectors, and classify the unknown query patten according to the distances between the query and the pseudo nearest neighbors. To assess the classification performance of the proposed LMPNN, it is compared with the competing classifiers, such as LMKNN and PNN, in terms of the classification error on thirty-two real UCI data sets, four artificial data sets and three image data sets. The comprehensively experimental results suggest that the proposed LMPNN classifier is a promising algorithm in pattern recognition.
The k-nearest neighbor (kNN) classifier is a classical classification algorithm that has been applied in many fields. However, the performance of the kNN classifier is limited by a simple neighbor ...selection method, called nearest neighbor (NN) rule, where only the neighborhood of the query is considered when selecting the nearest neighbors of the query. In other words, the NN rule only uses one-layer neighborhood information of the query.
In this paper, we propose a new neighbor selection method based on two-layer neighborhood information, called two-layer nearest neighbor (TLNN) rule. The neighborhood of the query and the neighborhoods of all selected training instances in this neighborhood are considered simultaneously, then the two-layer nearest neighbors of the query are determined according to the distance, distribution relationship, and backward nearest neighbor relationship between the query and all selected training instances in the above neighborhoods. In order to verify the effectiveness of the proposed TLNN rule, a k-two-layer nearest neighbor (kTLNN) classifier is proposed to measure the classification ability of the two-layer nearest neighbors.
Extensive experiments on twenty real-world datasets from UCI and KEEL repositories show that the kTLNN classifier outperforms not only the kNN classifier but also seven other state-of-the-art NN-based classifiers.
Wide-area monitoring of power systems is important for system security and stability. It involves the detection and localization of power system disturbances. However, the oscillatory trends and ...noise in electrical measurements often mask disturbances, making wide-area monitoring a challenging task. This paper presents a wide-area monitoring method to detect and locate power system disturbances by combining multivariate analysis known as Principal Component Analysis (PCA) and time series analysis known as k-Nearest Neighbor (kNN) analysis. Advantages of this method are that it can not only analyze a large number of wide-area variables in real time but also can reduce the masking effect of the oscillatory trends and noise on disturbances. Case studies conducted on data from a four-variable numerical model and the New England power system model demonstrate the effectiveness of this method.
The Inverted Multi-Index Babenko, Artem; Lempitsky, Victor
IEEE transactions on pattern analysis and machine intelligence,
2015-June-1, 2015-Jun, 2015-6-1, 20150601, Letnik:
37, Številka:
6
Journal Article
Recenzirano
A new data structure for efficient similarity search in very large datasets of high-dimensional vectors is introduced. This structure called the inverted multi-index generalizes the inverted index ...idea by replacing the standard quantization within inverted indices with product quantization. For very similar retrieval complexity and pre-processing time, inverted multi-indices achieve a much denser subdivision of the search space compared to inverted indices, while retaining their memory efficiency. Our experiments with large datasets of SIFT and GIST vectors demonstrate that because of the denser subdivision, inverted multi-indices are able to return much shorter candidate lists with higher recall. Augmented with a suitable reranking procedure, multi-indices were able to significantly improve the speed of approximate nearest neighbor search on the dataset of 1 billion SIFT vectors compared to the best previously published systems, while achieving better recall and incurring only few percent of memory overhead.
We propose a novel approach to solving the approximate k-nearest neighbor search problem in metric spaces. The search structure is based on a navigable small world graph with vertices corresponding ...to the stored elements, edges to links between them, and a variation of greedy algorithm for searching. The navigable small world is created simply by keeping old Delaunay graph approximation links produced at the start of construction. The approach is very universal, defined in terms of arbitrary metric spaces and at the same time it is very simple. The algorithm handles insertions in the same way as queries: by finding approximate neighbors for the inserted element and connecting it to them. Both search and insertion can be done in parallel requiring only local information from the structure. The structure can be made distributed. The accuracy of the probabilistic k-nearest neighbor queries can be adjusted without rebuilding the structure.
The performed simulation for data in the Euclidean spaces shows that the structure built using the proposed algorithm has small world navigation properties with log2(n) insertion and search complexity at fixed accuracy, and performs well at high dimensionality. Simulation on a CoPHiR dataset revealed its high efficiency in case of large datasets (more than an order of magnitude less metric computations at fixed recall) compared to permutation indexes. Only 0.03% of the 10 million 208-dimensional vector dataset is needed to be evaluated to achieve 0.999 recall (virtually exact search). For recall 0.93 processing speed 2800queries/s can be achieved on a dual Intel X5675 Xenon server node with Java implementation.
•Novel approximate k-nearest neighbor problem structure is presented.•The structure is based on a navigable small world graph.•The structure has weak dimensionality dependence.•The structure has shown superior performance in simulations.
Hashing has been widely used for large-scale approximate nearest neighbor search due to its storage and search efficiency. Recent supervised hashing research has shown that deep learning-based ...methods can significantly outperform nondeep methods. Most existing supervised deep hashing methods exploit supervisory signals to generate similar and dissimilar image pairs for training. However, natural images can have large intraclass and small interclass variations, which may degrade the accuracy of hash codes. To address this problem, we propose a novel two-stream ConvNet architecture, which learns hash codes with class-specific representation centers. Our basic idea is that if we can learn a unified binary representation for each class as a center and encourage hash codes of images to be close to the corresponding centers, the intraclass variation will be greatly reduced. Accordingly, we design a neural network that leverages label information and outputs a unified binary representation for each class. Moreover, we also design an image network to learn hash codes from images and force these hash codes to be close to the corresponding class-specific centers. These two neural networks are then seamlessly incorporated to create a unified, end-to-end trainable framework. Extensive experiments on three popular benchmarks corroborate that our proposed method outperforms current state-of-the-art methods.
•New SNN-based definition of and for center selection is presented.•The definition makes centers of low density cluster more distinguishable.•New two-step allocation method will not further magnify ...the misclassification.•The method is robust in dealing with non-convex or cross-wind clusters.•The method is robust when facing noise of intermediate degree in datasets.
Clustering by fast search and find of density peaks (DPC) is a new clustering method that was reported in Science in June 2014. This clustering algorithm is based on the assumption that cluster centers have high local densities and are generally far from each other. With a decision graph, cluster centers can be easily located. However, this approach suffers from certain disadvantages. First, the definition of the local density and distance measurement is too simple; therefore, the DPC algorithm might perform poorly on complex datasets that are of multiple scales, cross-winding, of various densities, or of high dimensionality. Second, the one-step allocation strategy is not robust and has poor fault tolerance. Thus, if a point is assigned incorrectly, then the subsequent allocation will further amplify the error, resulting in more errors, which will have a severe negative impact on the clustering results. Third, the cutoff distance dc is generally difficult to determine since the range of each attribute is unknown in most cases. Even when being normalized or using the relative percentage method, a small change in dc will still cause a conspicuous fluctuation in the result, and this is especially true for real-world datasets. Considering these drawbacks, we propose a shared-nearest-neighbor-based clustering by fast search and find of density peaks (SNN-DPC) algorithm. We present three new definitions: SNN similarity, local density ρ and distance from the nearest larger density point δ. These definitions take the information of the nearest neighbors and the shared neighbors into account, and they can self-adapt to the local surroundings. Then, we introduce our two-step allocation method: inevitably subordinate and possibly subordinate. The former quickly and accurately recognizes and allocates the points that certainly belong to one cluster by counting the number of shared neighbors between two points. The latter assigns the remaining points by finding the clusters to which more neighbors belong. The algorithm is benchmarked on publicly available synthetic datasets, UCI real-world datasets and the Olivetti Faces dataset, which are often used to test the performance of clustering algorithms. We compared the results with those of DPC, fuzzy weighted K-nearest neighbors density peak clustering (FKNN-DPC), affinity propagation (AP), ordering points to identify the clustering structure (OPTICS), density-based spatial clustering of applications with noise (DBSCAN), and K-means. The metrics used are adjusted mutual information (AMI), adjusted Rand index (ARI), and Fowlkes–Mallows index (FMI). The experimental results prove that our method can recognize clusters regardless of their size, shape, and dimensions; is robust to noise; and is remarkably superior to DPC, FKNN-DPC, AP, OPTICS, DBSCAN, and K-means.