Multi-class imbalance problems are frequently encountered in real-world applications of machine learning. They have fundamentally complex trade-offs between classes. Existing literature tends to use ...a predetermined rebalancing strategy and mainly focuses on overall performance measures. However, in many real-world problems, the true level of imbalance and the relative importance between classes are unknown, making it difficult to predetermine the rebalancing strategy and the evaluation criterion. In this paper, we explicitly consider the between-class trade-off issue in the multi-class imbalance problem. We consider all the classes to be important and find a set of optimal trade-offs for the decision-maker to choose from. To reduce the computational cost of this process and make it a practical method, we seek the help of selective ensemble and multiple undersampling rates, and propose the Multi-class Multi-objective Selective Ensemble (MMSE) framework. We further equip the objective modeling with margins to reduce the number of objectives when the task has many classes. Experimental results show that our proposed methods successfully obtain diverse and highly competitive solutions within an acceptable running time.
•We explicitly consider the between-class trade-off issue in the multi-class imbalance problem.•We design a multi-objective selective ensemble approach to efficiently obtain the optimal trade-off solutions.•We further propose a margin-based objective modeling to tackle the many-class case, and analyze its optimization ability.•Our methods successfully obtain diverse and highly competitive solutions within an acceptable running time.
The present research examines the landslide susceptibility in Rudraprayag district of Uttarakhand, India using the conditional probability (CP) statistical technique, the boost regression tree (BRT) ...machine learning algorithm, and the CP-BRT ensemble approach to improve the accuracy of the BRT model. Using the four fold of data, the models' outcomes were cross-checked. The locations of existing landslides were detected by general field surveys and relevant records. 220 previous landslide locations were obtained, presented as an inventory map, and divided into four folds to calibrate and authenticate the models. For modelling the landslide susceptibility, twelve LCFs (landslide conditioning factors) were used. Two statistical methods, i.e. the mean absolute error (MAE) and the root mean square error (RMSE), one statistical test, i.e. the Freidman rank test, as well as the receiver operating characteristic (ROC), efficiency and precision were used for authenticating the produced landslide models. The results of the accuracy measures revealed that all models have good potential to recognize the landslide susceptibility in the Garhwal Himalayan region. Among these models, the ensemble model achieved a higher accuracy (precision: 0.829, efficiency: 0.833, AUC: 89.460, RMSE: 0.069 and MAE: 0.141) than the individual models. According to the outcome of the ensemble simulations, the BRT model's predictive accuracy was enhanced by integrating it with the statistical model (CP). The study showed that the areas of fallow land, plantation fields, and roadsides with elevations of more than 1500 m. with steep slopes of 24° to 87° and eroding hills are highly susceptible to landslides. The findings of this work could help in minimizing the landslides' risk in the Western Himalaya and its adjoining areas with similar landscapes and geological characteristics.
Display omitted
•Considering the twelve landslide conditioning factors landslide susceptibility maps were prepared using CP, BRT and CP-BRT models.•After integration of BRT model with CP model the level of accuracy was increased.•Nearly 20% of the study areas have very high probability of landslide.
Probabilistic load forecasts provide comprehensive information about future load uncertainties. In recent years, many methodologies and techniques have been proposed for probabilistic load ...forecasting. Forecast combination, a widely recognized best practice in point forecasting literature, has never been formally adopted to combine probabilistic load forecasts. This paper proposes a constrained quantile regression averaging (CQRA) method to create an improved ensemble from several individual probabilistic forecasts. We formulate the CQRA parameter estimation problem as a linear program with the objective of minimizing the pinball loss and the constraints that the parameters are nonnegative and summing up to one. We demonstrate the effectiveness of the proposed method using two publicly available datasets, the ISO New England data and Irish smart meter data. Comparing with the best individual probabilistic forecast, the ensemble can reduce the pinball score by 4.39% on average. The proposed ensemble also demonstrates superior performance over nine other benchmark ensembles.
This paper reviews state-of-the-art on wind speed/power forecasting and solar irradiance forecasting with ensemble methods. The ensemble forecasting methods are grouped into two main categories: ...competitive ensemble forecasting and cooperative ensemble forecasting. The competitive ensemble forecasting is further categorized based on data diversity and parameter diversity. The cooperative ensemble forecasting is divided according to pre-processing and post-processing. Typical articles are discussed according to each category and their characteristics are highlighted. We also conduct comparisons based on reported results and comparisons based on simulations conducted by us. Suggestions for future research include ensemble of different paradigms and inter-category ensemble methods among others.
Significant challenges arise when Graph Neural Networks (GNNs) try to deal with uneven data. Specifically in signed and weighted graph structures. This makes classification tasks less effective. ...Within the GNN context, researchers have found traditional solutions like resampling, reweighting, and synthetic sample generation to be inadequate. GATE-GNN, a novel architecture designed specifically for imbalanced datasets, overcomes these limitations. GATE-GNN integrates an ensemble of network modules that harness the spatial features of graph networks and effectively utilise embedding information from earlier layers. This unique approach not only bolsters generalisation by reducing volatility. It also refines the optimisation algorithm, resulting in more accurate and stable classification outcomes. We rigorously tested the effectiveness of GATE-GNN on four widely recognised datasets: Cora, NELL, Citeseer, and PubMed. We performed a comparative analysis against established methods such as Graph Convolutional Networks (GCN), Graph Sample and Aggregate (GraphSAGE), Propagation Multilayer Perceptron PMLP), Imbalanced Node Sampling GNN (INS-GNN), GNN-Curriculum Learning (GNN-CL) and Graph Attention Networks (GAT). Empirical results demonstrate that GATE-GNN significantly outperforms these existing models, achieving an average improvement in classification accuracy of approximately 5%–10% over the previous best results. Additionally, GATE-GNN presents a marked reduction in training time. This underscores its efficiency and suitability for practical applications in imbalanced graph data scenarios. Implementation of the proposed GATE-GNN can be accessed here https://github.com/afofanah/GATE-GNN.
•We propose a GATE-GNN model for imbalanced node classification with imbalanced datasets.•We design a dynamic node interaction within GATE architecture and leveraging learnable weights.•We proposed GEWA as self-attention mechanism to enhance GNFE and NETL feature representation.•The proposed model showcased robust performance across four GNN imbalanced datasets.
N6-methyladenosine (m6A) is a prevalent RNA methylation modification, which plays an important role in various biological processes. Accurate identification of the m6A sites is fundamental to ...understand the biological functions and mechanisms of the modification deeply. However, the experimental methods for detecting m6A sites are usually time-consuming and expensive, and various computational methods have been developed to identify m6A sites in RNA. This paper proposes a novel cross-species computational method StackRAM using machine learning algorithms to identify the m6A sites in Saccharomyces cerevisiae (S. cerevisiae), Homo sapiens (H. sapiens), Arabidopsis thaliana (A. thaliana) and Mus musculus (M. musculus). First, the RNA sequence features are extracted through binary encoding, chemical property, nucleotide frequency, k-mer nucleotide frequency, pseudo dinucleotide composition, and position-specific trinucleotide propensity, and the initial feature dataset is obtained by feature fusion. Second, the Elastic Net is used for the first time to filter redundant and noisy information and retain important features for m6A sites classification. Finally, the base-classifiers output probabilities and the optimal feature subset corresponding to the Elastic Net are combined, and the combination feature is put into the second-stage meta-classifier SVM. The result of jackknife test on training dataset S. cerevisiae indicates that the prediction performance of StackRAM is superior to the current state-of-the-art methods. Prediction accuracy of StackRAM for independent test datasets H. sapiens, A. thaliana and M. musculus reach 92.30%, 87.06% and 91.86%, respectively. Therefore, StackRAM has developing potential in cross-species prediction and can be a useful method for identifying m6A sites.
•A novel cross-species method (StackRAM) is used to identify RNA N6-methyladenosine sites.•The RNA samples are extracted by combining sequence-based features and physicochemical property-based features.•Elastic Net method is employed to obtain the optimal feature subset for the first time.•StackRAM can mine the essential abstract features that characterize RNA methylation sites through hierarchical learning.•The proposed method increases the prediction performance on independent testing datasets compared with other methods.
The cloud computing environment requires an adequate and accurate traffic prediction tool to fulfill the needs of customers and support organizations effectively. In the absence of an effective tool ...for forecasting cloud computing traffic, many organizations might fail. It is difficult to predict the network resources that are suitable to meet the needs of all network clients at a given time in a cloud computing environment because of the inconsistent network traffic flow. There is still room for improving the predictive accuracy of the model in cloud computing. The higher the accuracy of the traffic flow, the better the allocation of resources. Therefore, this study proposes an ensemble method called SGLA (Stepwise Gaussian Linear Autoregressive) by combining linear regression, support vector machines, Gaussian process regression, and the autoregressive integrated moving average technique. SGLA performed better than all methods with a minimum MAPE of 1.03% of the ensemble approach by using the averaging strategy, SGLA shows a clear advantage in handling resource allocation better despite traffic fluctuations, with 91.7% traffic prediction accuracy. Overall experimental results indicate that this method performed better than single models in terms of prediction accuracy. The main contribution of this study is to propose a data analytics model for enhancing cloud computing resource management.
•We propose an ensemble STLF method based on the ELM.•A wavelet-based ensemble scheme is introduced to STLF.•A parallel model of 24 ELMs is established for 24-h load prediction.•Both 1-h and 24-h ...ahead load forecasting are evaluated.•The proposed method outperforms other techniques on the public datasets.
This paper proposes a novel ensemble method for short-term load forecasting based on wavelet transform, extreme learning machine (ELM) and partial least squares regression. In order to improve forecasting performance, a wavelet-based ensemble strategy is introduced into the forecasting model. The individual forecasters are derived from different combinations of mother wavelet and number of decomposition levels. For each sub-component from the wavelet decomposition, a parallel model consisting of 24 ELMs is invoked to predict the hourly load of the next day. The individual forecasts are then combined to form the ensemble forecast using the partial least squares regression method. Numerical results show that the proposed method can significantly improve forecasting performance.
Breast cancer is the second deadliest disease amongst women worldwide. Breast histopathology image analysis is one of the most powerful ways used for the detection of tumour malignancies. Manual ...breast histopathology image analysis is, however, subjective, time-consuming and prone to human errors. Computer-aided diagnosis (CAD) has become a popular and viable solution for medical image analysis due to recent advances in computer power and memory. However, the performance of the CAD models needs to be improved to use for practical purposes. Convolutional neural network (CNN) based models have achieved promising results for breast histopathological image classification. In this paper, instead of relying on a single CNN model, we have proposed a novel rank-based ensemble method by combining outcomes of three transfer learning CNN models, namely GoogleNet, VGG11 and MobileNetV3_Small. The proposed ensemble model is designed using the Gamma function for solving a 2-class classification problem of breast histopathological images. In comparison to state-of-the-art approaches, our method produces better classification results, with 99.16%, 98.24%, 98.67%, and 96.16% for 40X, 100X, 200X, and 400X levels of magnification, respectively, on a publicly accessible standard dataset called BreakHis and 96.95% on another well-known dataset called ICIAR-2018.
•A rank based ensemble of deep learning models is designed to detect breast cancer.•GoogleNet, VGG11, and MobileNetV3_Small are used as base learners.•Decision scores of CNN models are fused using the Gamma function.•The proposed method is tested on two histopathology datasets: BreakHis & ICIAR-2018.