It is well known that active learning can simultaneously improve the quality of the classification model and decrease the complexity of training instances. However, several previous studies have ...indicated that the performance of active learning is easily disrupted by an imbalanced data distribution. Some existing imbalanced active learning approaches also suffer from either low performance or high time consumption. To address these problems, this paper describes an efficient solution based on the extreme learning machine (ELM) classification model, called active online-weighted ELM (AOW-ELM). The main contributions of this paper include: 1) the reasons why active learning can be disrupted by an imbalanced instance distribution and its influencing factors are discussed in detail; 2) the hierarchical clustering technique is adopted to select initially labeled instances in order to avoid the missed cluster effect and cold start phenomenon as much as possible; 3) the weighted ELM (WELM) is selected as the base classifier to guarantee the impartiality of instance selection in the procedure of active learning, and an efficient online updated mode of WELM is deduced in theory; and 4) an early stopping criterion that is similar to but more flexible than the margin exhaustion criterion is presented. The experimental results on 32 binary-class data sets with different imbalance ratios demonstrate that the proposed AOW-ELM algorithm is more effective and efficient than several state-of-the-art active learning algorithms that are specifically designed for the class imbalance scenario.
The scale of the radius for constructing neighborhood relation has a great effect on the results of neighborhood rough sets and corresponding measures. A very small radius frequently brings us ...nothing because any two different samples are separated from each other, though these two samples have the same label. If the radius is growing, then there is a serious risk that samples with different labels may fall into the same neighborhood. Obviously, the radius based neighborhood relation does not take the labels of samples into account, which will lead to unsatisfactory discrimination. To fill such gap, a pseudo-label strategy is systematically studied in rough set theory. Firstly, a pseudo-label neighborhood relation is proposed. Such relation can differentiate samples by not only the distance but also the pseudo labels of samples. Therefore, both the neighborhood rough set and some corresponding measures can be re-defined. Secondly, attribute reductions are explored based on the re-defined measures. The heuristic algorithm is also designed to compute reducts. Finally, the experimental results over UCI data sets tell us that our pseudo-label strategy is superior to the traditional neighborhood approach. This is mainly because the former can significantly reduce the uncertainties and improve the classification accuracies. The Wilcoxon signed rank test results also show that neighborhood approach and pseudo-label neighborhood approach are so different from the viewpoints of the measures and attribute reductions in rough set theory.
•Multi-granularity is considered in the attribute reduction for improving classification performances.•Multi-granularity attribute selector is proposed to accelerate searching of reduct.•Our proposed ...selector is efficient and effective.
Presently, the mechanism of multi-granularity has been frequently realized by various mathematical tools in Granular Computing especially rough set. Nevertheless, as a key topic of rough set, attribute reduction has been rarely exploited by the concept of multi-granularity. To fill such a gap, Multi-Granularity Attribute Reduction is defined to characterize reduct which satisfies the intended multi-granularity constraint instead of one and only one granularity based constraint. Furthermore, to accelerate the searching process of reduct, Multi-Granularity Attribute Selector is introduced into the framework of heuristic algorithm. Its key procedure is twofold including: (1) fuse all the granularities based measure-values to construct the multi-granularity constraint; (2) integrate the suitable granularities based measure-values to evaluate the candidate attributes. Based on the multi-granularity structure formed by neighborhood rough set, the experimental results over 20 UCI data sets demonstrate that compared with single granularity attribute reduction, our selector can not only generate reducts which may not contribute to poorer classification performances, but also significantly reduce the elapsed time of computing reducts. This research suggests the new trend of attribute reduction in multi-granularity environment.
The traditional k-means, which unambiguously assigns an object precisely to a single cluster with crisp boundary, does not adequately show the fact that a cluster may not have a well-defined cluster ...boundary. This paper presents a three-way k-means clustering algorithm based on three-way strategy. In the proposed method, an overlap clustering is used to obtain the supports (unions of the core regions and the fringe regions) of the clusters and perturbation analysis is applied to separate the core regions from the supports. The difference between the support and the core region is regarded as the fringe region of the specific cluster. Therefore, a three-way explanation of the cluster is naturally formed. Davies–Bouldin index (
DB
), Average Silhouette index (
AS
) and Accuracy (
ACC
) are computed by using core region to evaluate the structure of three-way k-means result. The experimental results on UCI data sets and USPS data sets show that such strategy is effective in improving the structure of clustering results.
Two-way clustering algorithms use one single set to represent a cluster, which cannot intuitively show the fringe objects of a cluster. Three-way clustering uses the core region and the fringe region ...to describe a cluster, which divide the universe into three disjoint sets to capture the three types of relationships between a cluster and a sample, namely, belong-to fully, belong-to partially and not belong-to fully. One of the main problems of three-way clustering is to construct the core and the fringe of each cluster. In this paper, we propose a three-way clustering algorithm by using the stability of each sample. In the proposed algorithm, we use a set of base clustering results as inputs to obtain the samples' stability by using the co-association matrix and determinacy function. With this stability, the universe is divided into the universe core and the universe fringe according to a threshold for sample's stability. The universe core is constituted by the samples with high stability and is divided into the core region of each cluster by using kmeans algorithm. Whereas the universe fringe is constituted by the samples with low stability and is assigned into the fringe region of each cluster. Therefore, a three-way explanation of the cluster is naturally formed. The experimental results on UCI data sets show that the proposed algorithm is effective in revealing cluster structures.
•The reason of the damage caused by class imbalance for ELM is analyzed in theory.•The influence factors about the performance of ELM on skewed data are investigated.•An optimal decision outputs ...compensation-based ELM called ODOC-ELM is presented.•Exploring prior data distributions helps improve quality of ELM.•Statistical results indicate the superiority of the proposed ODOC-ELM algorithm.
Extreme learning machine (ELM) has been one widely used learning paradigm to train single hidden layer feedforward network (SLFN). However, like many other classification algorithms, ELM may learn undesirable class boundaries from data with unbalanced classes. This paper first tries to analyze the reason of the damage caused by class imbalance for ELM, and then discusses the influence of several data distribution factors for the damage. Next, we present an optimal decision outputs compensation strategy to deal with the class imbalance problem in the context of ELM. Specifically, the outputs of the minority classes in ELM are properly compensated. For a binary-class problem, the compensation can be regarded as a single variable optimization problem, thus the golden section search algorithm is adopted to find the optimal compensation value. For a multi-class problem, the particle swarm optimization (PSO) algorithm is used to solve the multivariate optimization problem and to provide the optimal combination of compensations. Experimental results on lots of imbalanced data sets demonstrate the superiority of the proposed algorithm. Statistical results indicate that the proposed approach not only outperforms the original ELM, but also yields better or at least competitive results compared with several widely used and state-of-the-art class imbalance learning methods.
Clustering plays an important role in data mining technology.Traditional clustering algorithms are hard clustering algorithms, namely, objects either belong to a class or do not belong to a ...class.However, when dealing with uncertain data, forced division will lead to decision-making errors.Three-way k-means clustering algorithm can divide the data into several groups with uncertain boundary reasonably.But it is still sensitive to the initial clustering center.In order to solve this problem, this paper presents a three-way k-means clustering algorithm based on artificial bee colony by integrating artificial bee colony algorithm with three-way k-means clustering algorithm.The fitness function of honey source is constructed by class cohesion function and inter class dispersion function to guide the bee colony to search for high-quality honey source globally.Using the cooperation and exchange of different roles between bee colonies, the data set is clustered repeatedly to find the optimal honey source location, w
It is well known that in supervised learning, active learning could effectively decrease the complexity of training instances without obvious loss of the classification performance. Generally, active ...learning is applied in the scenario that lots of instances are easy to be acquired, but labeling them is expensive and/or time-consuming. In this study, we try to implement active learning by using extreme learning machine (ELM) classifier based on three reasons as follows: (1) ELM has light computational costs, (2) ELM has strong generalization ability which is even comparable with support vector machine (SVM) and (3) ELM could be directly applied on both binary-class and multiclass problems. Specifically, an active learning algorithm based on ELM classifier named AL-ELM is proposed in this paper. During active learning, AL-ELM estimates the uncertainty of each unlabeled instance by creating a mapping relation between the actual outputs of the instance in ELM and the approximated membership probability of the same instance. In other words, ELM is converted as the equivalent Bayes classifier. On each iteration, those most uncertain instances are extracted and labeled to promote the quality of classification model. The learning procedure stops until it satisfies a pre-designed criterion. Experimental results on 20 benchmark data sets show that AL-ELM is better than or at least comparable to several state-of-the-art uncertainty-based active learning algorithms. Also, in contrast with several other algorithms, AL-ELM could effectively decrease the running time of learning procedure.
Protein-DNA interactions are ubiquitous in a wide variety of biological processes. Correctly locating DNA-binding residues solely from protein sequences is an important but challenging task for ...protein function annotations and drug discovery, especially in the post-genomic era where large volumes of protein sequences have quickly accumulated. In this study, we report a new predictor, named TargetDNA, for targeting protein-DNA binding residues from primary sequences. TargetDNA uses a protein's evolutionary information and its predicted solvent accessibility as two base features and employs a centered linear kernel alignment algorithm to learn the weights for weightedly combining the two features. Based on the weightedly combined feature, multiple initial predictors with SVM as classifiers are trained by applying a random under-sampling technique to the original dataset, the purpose of which is to cope with the severe imbalance phenomenon that exists between the number of DNA-binding and non-binding residues. The final ensembled predictor is obtained by boosting the multiple initially trained predictors. Experimental simulation results demonstrate that the proposed TargetDNA achieves a high prediction performance and outperforms many existing sequence-based protein-DNA binding residue predictors. The TargetDNA web server and datasets are freely available at http://csbio.njust.edu.cn/bioinf/TargetDNA/ for academic use.
Calculating single-source shortest paths (SSSPs) rapidly and precisely from weighted digraphs is a crucial problem in graph theory. As a mathematical model of processing uncertain tasks, rough sets ...theory (RST) has been proven to possess the ability of investigating graph theory problems. Recently, some efficient RST approaches for discovering different subgraphs (e.g. strongly connected components) have been presented. This work was devoted to discovering SSSPs of weighted digraphs by aid of RST. First, SSSPs problem was probed by RST, which aimed at supporting the fundamental theory for taking RST approach to calculate SSSPs from weighted digraphs. Second, a heuristic search strategy was designed. The weights of edges can be served as heuristic information to optimize the search way of $ k $-step $ R $-related set, which is an RST operator. By using heuristic search strategy, some invalid searches can be avoided, thereby the efficiency of discovering SSSPs was promoted. Finally, the W3SP@R algorithm based on RST was presented to calculate SSSPs of weighted digraphs. Related experiments were implemented to verify the W3SP@R algorithm. The result exhibited that W3SP@R can precisely calculate SSSPs with competitive efficiency.