•Noisy and borderline examples in imbalanced datasets harm classifier performance.•Our proposal reduces the noise and makes the class boundaries more regular.•Our proposal performs better than other ...re-sampling methods in this scenario.•Ensemble-based noise filters work well when dealing with noise.•The iterative noise elimination is a good approach to deal with noisy datasets.
Classification datasets often have an unequal class distribution among their examples. This problem is known as imbalanced classification. The Synthetic Minority Over-sampling Technique (SMOTE) is one of the most well-know data pre-processing methods to cope with it and to balance the different number of examples of each class. However, as recent works claim, class imbalance is not a problem in itself and performance degradation is also associated with other factors related to the distribution of the data. One of these is the presence of noisy and borderline examples, the latter lying in the areas surrounding class boundaries. Certain intrinsic limitations of SMOTE can aggravate the problem produced by these types of examples and current generalizations of SMOTE are not correctly adapted to their treatment.
This paper proposes the extension of SMOTE through a new element, an iterative ensemble-based noise filter called Iterative-Partitioning Filter (IPF), which can overcome the problems produced by noisy and borderline examples in imbalanced datasets. This extension results in SMOTE–IPF. The properties of this proposal are discussed in a comprehensive experimental study. It is compared against a basic SMOTE and its most well-known generalizations. The experiments are carried out both on a set of synthetic datasets with different levels of noise and shapes of borderline examples as well as real-world datasets. Furthermore, the impact of introducing additional different types and levels of noise into these real-world data is studied. The results show that the new proposal performs better than existing SMOTE generalizations for all these different scenarios. The analysis of these results also helps to identify the characteristics of IPF which differentiate it from other filtering approaches.
Training classifiers with datasets which suffer of imbalanced class distributions is an important problem in data mining. This issue occurs when the number of examples representing the class of ...interest is much lower than the ones of the other classes. Its presence in many real-world applications has brought along a growth of attention from researchers.
We shortly review the many issues in machine learning and applications of this problem, by introducing the characteristics of the imbalanced dataset scenario in classification, presenting the specific metrics for evaluating performance in class imbalanced learning and enumerating the proposed solutions. In particular, we will describe preprocessing, cost-sensitive learning and ensemble techniques, carrying out an experimental study to contrast these approaches in an intra and inter-family comparison.
We will carry out a thorough discussion on the main issues related to using data intrinsic characteristics in this classification problem. This will help to improve the current models with respect to: the presence of small disjuncts, the lack of density in the training data, the overlapping between classes, the identification of noisy data, the significance of the borderline instances, and the dataset shift between the training and the test distributions. Finally, we introduce several approaches and recommendations to address these problems in conjunction with imbalanced data, and we will show some experimental examples on the behavior of the learning algorithms on data with such intrinsic characteristics.
We propose a Bayesian physics-informed neural network (B-PINN) to solve both forward and inverse nonlinear problems described by partial differential equations (PDEs) and noisy data. In this Bayesian ...framework, the Bayesian neural network (BNN) combined with a PINN for PDEs serves as the prior while the Hamiltonian Monte Carlo (HMC) or the variational inference (VI) could serve as an estimator of the posterior. B-PINNs make use of both physical laws and scattered noisy measurements to provide predictions and quantify the aleatoric uncertainty arising from the noisy data in the Bayesian framework. Compared with PINNs, in addition to uncertainty quantification, B-PINNs obtain more accurate predictions in scenarios with large noise due to their capability of avoiding overfitting. We conduct a systematic comparison between the two different approaches for the B-PINNs posterior estimation (i.e., HMC or VI), along with dropout used for quantifying uncertainty in deep neural networks. Our experiments show that HMC is more suitable than VI with mean field Gaussian approximation for the B-PINNs posterior estimation, while dropout employed in PINNs can hardly provide accurate predictions with reasonable uncertainty. Finally, we replace the BNN in the prior with a truncated Karhunen-Loève (KL) expansion combined with HMC or a deep normalizing flow (DNF) model as posterior estimators. The KL is as accurate as BNN and much faster but this framework cannot be easily extended to high-dimensional problems unlike the BNN based framework.
Deep learning based object detection methods have achieved promising performance in controlled environments. However, these methods lack sufficient capabilities to handle underwater object detection ...due to these challenges: (1) images in the underwater datasets and real applications are blurry whilst accompanying severe noise that confuses the detectors and (2) objects in real applications are usually small. In this paper, we propose a Sample-WeIghted hyPEr Network (SWIPENET), and a novel training paradigm named Curriculum Multi-Class Adaboost (CMA), to address these two problems at the same time. Firstly, the backbone of SWIPENET produces multiple high resolution and semantic-rich Hyper Feature Maps, which significantly improve small object detection. Secondly, inspired by the human education process that drives the learning from easy to hard concepts, we propose the noise-robust CMA training paradigm that learns the clean data first and then move on to learns the diverse noisy data. Experiments on four underwater object detection datasets show that the proposed SWIPENET+CMA framework achieves better or competitive accuracy in object detection against several state-of-the-art approaches.
Imbalanced data classification remains a focus of intense research, mostly due to the prevalence of data imbalance in various real-life application domains. A disproportion among objects from ...different classes may significantly affect the performance of standard classification models. The first problem is the high imbalance ratios that pose a serious learning difficulty and require usage of dedicated methods, capable of alleviating this issue. The second important problem which may appear is noise, which may be accompanying the training data and causing strong deterioration of the classifier performance or increase the time required for its training. Therefore, the desirable classification model should be robust to both skewed data distributions and noise. One of the most popular approaches for handling imbalanced data is oversampling of the minority objects in their neighborhood. In this work we will criticize this approach and propose a novel strategy for dealing with imbalanced data, with particular focus on the noise presence. We propose Radial-Based Oversampling (RBO) method, which can find regions in which the synthetic objects from minority class should be generated on the basis of the imbalance distribution estimation with radial basis functions. Results of experiments, carried out on a representative set of benchmark datasets, confirm that the proposed guided synthetic oversampling algorithm offers an interesting alternative to popular state-of-the-art solutions for imbalanced data preprocessing.
The Deep Operator Network (DeepONet) is a neural network architecture used to approximate operators, including the solution operator of parametric PDEs. DeepONets have shown remarkable approximation ...ability. However, the performance of DeepONets deteriorates when the training data is polluted with noise, a scenario that occurs in practice. To handle noisy data, we propose a Bayesian DeepONet based on replica exchange Langevin diffusion (reLD). Replica exchange uses two particles. The first particle trains a DeepONet to exploit the loss landscape and make predictions. The other particle trains a different DeepONet to explore the loss landscape and escape local minima via swapping. Compared to DeepONets trained with state-of-the-art gradient-based algorithms (e.g., Adam), the proposed Bayesian DeepONet greatly improves the training convergence for noisy scenarios and accurately estimates the uncertainty. To further reduce the high computational cost of the reLD training of DeepONets, we propose (1) an accelerated training framework that exploits the DeepONet's architecture to reduce its computational cost up to 25% without compromising performance and (2) a transfer learning strategy that accelerates training DeepONets for PDEs with different parameter values. Finally, we illustrate the effectiveness of the proposed Bayesian DeepONet using four parametric PDE problems.
•A Bayesian framework is developed to solve parametric PDEs using DeepONets.•The replica exchange SGLD algorithm is used to train the Bayesian DeepONet.•The proposed Bayesian DeepONet enables training with noisy targets/labels.•An accelerated method is developed to reduce the training cost up to 25%.•Transfer learning is used to reduce the cost of training new models.
Community detection and network reconstruction are two major concerns in network analysis. However, these two tasks are extremely challenging since most of the existing methods are not suitable for ...noisy and red time-evolving data, which are common in real world situations. To cope with this, we propose a novel method called the group-based binary time-evolving mixture (GBTM) model to detect communities and recover network structures jointly. This is the first to study address the challenges of community detection and network reconstruction in scenarios where data are dynamic and cannot be directly observed. In this work, the hidden Markov method is employed to capture the temporal evolution of node connections. In addition, we develop the grouped Baum-Welch algorithm for parameter estimation using a forward-backward procedure. Our GBTM model shows that conducting community detection and network reconstruction simultaneously can yield synergistic benefits. Furthermore, we introduce an innovative Bayesian information criterion (BIC) for determining the number of communities. The results of various simulations under different settings and two real-world networks show that the proposed GBTM model outperforms the existing community detection or network reconstruction methods and has great potential for solving time-evolving and noisy network problems.
In engineering problem, there exists many mixed datasets composed of multi-type data, including data without noise, data with homoscedastic noise (known noise variance), and data with heteroscedastic ...noise (unknown noise variance). To construct accurate predictive models, it is essential to fully utilize the information provided by the diverse dataset, particularly when confronted with limitations in training samples. Motivated by this, this paper introduces a novel transfer learning surrogate modeling approach to fuse multi-type data for prediction purpose. Numerical cases and two engineering applications validate the accuracy and robustness of the proposed model. Comparative analyses reveal superior predictive accuracy and robustness compared to other models. Additionally, by varying the levels of homoscedastic and heteroscedastic noise, the impact of noise on the proposed model is studied. Results indicate that this model has superior applicability and stability compared to other models. The proposed approach is capable of fusing multi-type data and adapting to different levels of noise, making it a promising approach for engineering application.
Many empirical economists say that the teaching of econometrics is unbalanced, and students are not well-prepared for the serious problems they will encounter with real data. Here, the author ...considers the problem of noisy data, which is present in most econometric studies, but receives far too little attention. Most econometric studies are done in a world of low signal-to-noise ratios, and educated common sense suggests that we cannot expect precise results in such an environment. Sensitivity analysis shows that the apparent precision of reported econometric results is generally an illusion, because it is highly dependent on error term independence assumptions.
1
,
2