Research Highlights•We present a new distributed MapReduce Prototype Reduction method.•This approach features a linear time complexity.•The prototypes are built using fuzzy rules.•The number of ...Mappers/Reducers does not affect the accuracy of the model.•The main advantages are flexibility and simplicity.
The growing amount of available data has become a serious challenge to data mining and machine learning techniques. Well-known classification methods that have been widely applied so far are no longer feasible in Big Data environments. For this reason, prototype reduction techniques (both selection and generation) come up as a candidate solution to build a reduced version of the dataset that speeds up the execution of algorithms such as k-Nearest Neighbors and overcome their memory constraints. However, these solutions generally have a quadratic O(N2) time complexity and share similar limitations to those encountered in data mining and machine learning algorithms in terms of time and memory requirements. In order to overcome these limitations, we introduce a new distributed MapReduce prototype generation method called CHI-PG that provides a linear O(N) time complexity and ensures constant accuracy regardless of the degree of parallelism. This approach builds prototypes by applying a simple scheme based on the rule generation process of the Chi et al. Fuzzy Rule-Based Classification System and takes advantage of the suitability of this classifier for the MapReduce paradigm. The empirical study shows that our new approach significantly improves the execution time of a state-of-the-art distributed prototype reduction algorithm (MRPR) without decreasing (and even improving) classification accuracy and reduction rates. Moreover, CHI-PG has been shown to be a candidate solution to the time and memory constraints of k-Nearest Neighbors when tackling large-scale datasets.
Learning from imbalanced datasets is highly demanded in real-world applications and a challenge for standard classifiers that tend to be biased towards the classes with the majority of the examples. ...Undersampling approaches reduce the size of the majority class to balance the class distributions. Evolutionary-based approaches are prominent, treating undersampling as a binary optimisation problem that determines which examples are removed. However, their utilisation is limited to small datasets due to fitness evaluation costs. This work proposes a two-stage clustering-based surrogate model that enables evolutionary undersampling to compute fitness values faster. The main novelty lies in the development of a surrogate model for binary optimisation which is based on the meaning (phenotype) rather than their binary representation (genotype). We conduct an evaluation on 44 imbalanced datasets, showing that in comparison with the original evolutionary undersampling, we can save up to 83% of the runtime without significantly deteriorating the classification performance.
•EUSC is a clustering-based surrogate model for evolutionary undersampling.•We developed a surrogate model for binary optimisation based on the phenotype.•EUSC can save up to 83% of the runtime without reducing significantly the performance.•EUSC is more effective and robust than existing approximations based on windowing.
The One-vs-One strategy is one of the most commonly used decomposition technique to overcome multi-class classification problems; this way, multi-class problems are divided into easier-to-solve ...binary classification problems considering pairs of classes from the original problem, which are then learned by independent base classifiers.
The way of performing the division produces the so-called non-competence. This problem occurs whenever an instance is classified, since it is submitted to all the base classifiers although the outputs of some of them are not meaningful (they were not trained using the instances from the class of the instance to be classified). This issue may lead to erroneous classifications, because in spite of their incompetence, all classifiers' decisions are usually considered in the aggregation phase.
In this paper, we propose a dynamic classifier selection strategy for One-vs-One scheme that tries to avoid the non-competent classifiers when their output is probably not of interest. We consider the neighborhood of each instance to decide whether a classifier may be competent or not. In order to verify the validity of the proposed method, we will carry out a thorough experimental study considering different base classifiers and comparing our proposal with the best performer state-of-the-art aggregation within each base classifier from the five Machine Learning paradigms selected. The findings drawn from the empirical analysis are supported by the appropriate statistical analysis.
•Non-competent classifiers are one of the problems of One-vs-One strategy.•Non-competence cannot be completely solved, but it can be reduced.•A novel dynamic strategy is presented to reduce the non-competent classifiers.•The neighbors of each instance are taken into account to avoid the non-competence.•The new strategy outperforms the state-of-the-art aggregations.
The imbalanced class problem is related to the real-world application of classification in engineering. It is characterised by a very different distribution of examples among the classes. The ...condition of multiple imbalanced classes is more restrictive when the aim of the final system is to obtain the most accurate precision for each of the concepts of the problem.
The goal of this work is to provide a thorough experimental analysis that will allow us to determine the behaviour of the different approaches proposed in the specialised literature. First, we will make use of binarization schemes, i.e., one versus one and one versus all, in order to apply the standard approaches to solving binary class imbalanced problems. Second, we will apply several ad hoc procedures which have been designed for the scenario of imbalanced data-sets with multiple classes.
This experimental study will include several well-known algorithms from the literature such as decision trees, support vector machines and instance-based learning, with the intention of obtaining global conclusions from different classification paradigms. The extracted findings will be supported by a statistical comparative analysis using more than 20 data-sets from the KEEL repository.
•We study the suitability of several Multiple Classifiers Systems (MCSs) with noisy data.•MCSs studied usually outperform their single learners in terms of global performance.•Heterogeneous MCSs will ...not be more robust than their most robust single learner.•MCSs’ results depend on the classifiers chosen, and the type and level of noise.
Traditional classifier learning algorithms build a unique classifier from the training data. Noisy data may deteriorate the performance of this classifier depending on the degree of sensitiveness to data corruptions of the learning method. In the literature, it is widely claimed that building several classifiers from noisy training data and combining their predictions is an interesting method of overcoming the individual problems produced by noise in each classifier. This statement is usually not supported by thorough empirical studies considering problems with different types and levels of noise. Furthermore, in noisy environments, the noise robustness of the methods can be more important than the performance results themselves and, therefore, it must be carefully studied. This paper aims to reach conclusions on such aspects focusing on the analysis of the behavior, in terms of performance and robustness, of several Multiple Classifier Systems against their individual classifiers when these are trained with noisy data. In order to accomplish this study, several classification algorithms, of varying noise robustness, will be chosen and compared with respect to their combination on a large collection of noisy datasets. The results obtained show that the success of the Multiple Classifier Systems trained with noisy data depends on the individual classifiers chosen, the decisions combination method and the type and level of noise present in the dataset, but also on the way of creating diversity to build the final system. In most of the cases, they are able to outperform all their single classification algorithms in terms of global performance, even though their robustness results will depend on the way of introducing diversity into the Multiple Classifier System.
Display omitted
•A consensus method via penalty functions for decision making in ensembles of fuzzy rule-based classification systems is introduced.•Overlap indices are built using overlap ...functions.•A method for constructing confidence and support measures from overlap indices is presented.•A new fuzzy rule mechanism is proposed, considering different overlap indices, which generalizes the classical methods.•An example of a generation of fuzzy rule-based ensembles and the decision making by consensus via penalty functions is presented.
The aim of this paper is to propose a consensus method via penalty functions for decision making in ensembles of fuzzy rule-based classification systems (FRBCSs). For that, we first introduce a method based on overlap indices for building confidence and support measures, which are usually used to evaluate the degree of certainty or interest of a certain association rule. Those overlap indices (a generalizations of the Zadeh's consistency index between two fuzzy sets) are built using overlap functions, which are a special kind of non necessarily associative aggregation functions proposed for applications related to the overlap problem and/or when the associativity property is not demanded. Then, we introduce a new FRM for the FRBCS, considering different overlap indices, which generalizes the classical methods. By considering several overlap indices and aggregation functions, we generate fuzzy rule-based ensembles, providing different results. For the decision making related to the selection of the best class, we introduce a consensus method for classification, based on penalty functions. We also present theoretical results related to the developed methods. A detailed example of a generation of fuzzy rule-based ensembles based on the proposed approach, and the decision making by consensus via penalty functions, is presented.
One-vs-One strategy divides the original multi-class problem into as many binary classification problems as pairs of classes. Then, independent base classifiers are learned to face each problem, ...whose outputs are combined to predict a single class label. This way, the accuracy of the baseline classifiers without decomposition is usually enhanced, aside from enabling the usage of binary classifiers, i.e., Support Vector Machines, to solve multi-class problems. This paper analyzes the fact that existing aggregations favor easily recognizable classes; hence, the accuracy enhancement mainly comes from the higher correct classification rates over these classes. Using other evaluation criteria, the significant improvements of One-vs-One are diminished, showing a weakness due to the presence of difficult classes. Difficult classes can be defined as those obtaining a lower correct classification rate than that obtained by the other classes in the problem. After studying the problem of difficult classes in this framework and aiming to empower these classes, a novel similarity-based aggregation is presented, which generalizes the well-known weighted voting. The experimental analysis shows that the new methodology is able to increase the recognition of difficult classes, obtaining a more balanced performance over all classes, which is a desirable behavior. The methodology is tested within several Machine Learning paradigms and is compared with the state-of-the-art on aggregations for One-vs-One strategy. The results are contrasted by the proper statistical tests, as suggested in the literature.
Stereo matching problem attempts to find corresponding locations between pairs of displaced images of the same scene. Correspondence estimation between pixels suffers from occlusions, noise, and ...bias. This paper introduces a novel approach to represent images by means of interval-valued fuzzy sets. These sets allow one to overcome the uncertainty due to the aforementioned problems. The aim is to take advantage of the new representation to develop a stereo matching algorithm. The interval-valued fuzzification process for images that is proposed here is based on image segmentation. Interval-valued fuzzy similarities are introduced to compare windows whose pixels are represented by intervals. To make use of color information, the similarities of the RGB channels were aggregated using the luminance formula. The experimental analysis makes a comparison with other methods. The new representation that is proposed together with the new similarity measure show a better overall behavior, providing more accurate correspondences, mainly near depth discontinuities and for images with a large amount of color.
Classification of Bitcoin entities is an important task to help Law Enforcement Agencies reduce anonymity in the Bitcoin blockchain network and to detect classes more tied to illegal activities. ...However, this task is strongly conditioned by a severe class imbalance in Bitcoin datasets. Existing approaches for addressing the class imbalance problem can be improved considering generative adversarial networks (GANs) that can boost data diversity. However, GANs are mainly applied in computer vision and natural language processing tasks, but not in Bitcoin entity behaviour classification where they may be useful for learning and generating synthetic behaviours. Therefore, in this work, we present a novel approach to address the class imbalance in Bitcoin entity classification by applying GANs. In particular, three GAN architectures were implemented and compared in order to find the most suitable architecture for generating Bitcoin entity behaviours. More specifically, GANs were used to address the Bitcoin imbalance problem by generating synthetic data of the less represented classes before training the final entity classifier. The results were used to evaluate the capabilities of the different GAN architectures in terms of training time, performance, repeatability, and computational costs. Finally, the results achieved by the proposed GAN-based resampling were compared with those obtained using five well-known data-level preprocessing techniques. Models trained with data resampled with our GAN-based approach achieved the highest accuracy improvements and were among the best in terms of precision, recall and f1-score. Together with Random Oversampling (ROS), GANs proved to be strong contenders in addressing Bitcoin class imbalance and consequently in reducing Bitcoin entity anonymity (overall and per-class classification performance). To the best of our knowledge, this is the first work to explore the advantages and limitations of GANs in generating specific Bitcoin data and “attacking” Bitcoin anonymity. The proposed methods ultimately demonstrate that in Bitcoin applications, GANs are indeed able to learn the data distribution and generate new samples starting from a very limited class representation, which leads to better detection of classes related to illegal activities.