Medical data classification is considered to be a challenging task in the field of medical informatics. Although many works have been reported in the literature, there is still scope for improvement. ...In this paper, a feature ranking based approach is developed and implemented for medical data classification. The features of a dataset are ranked using some suitable ranker algorithms, and subsequently the Random Forest classifier is applied only on highly ranked features to construct the predictor. We have conducted extensive experiments on 10 benchmark datasets and the results are promising. We present highly accurate predictors for 10 different diseases, as well as suggest a methodology that is sufficiently general and is expected to perform well for other diseases with similar datasets.
In the last decade, machine learning (ML) techniques have been widely applied to identify different diseases. This facilitates an early diagnosis and increases the chance of survival. The majority of ...medical data-sets are unbalanced. Due to this, ML classification techniques give biased classification over the majority class. In this paper, a novel fitness function in Genetic Programming, for medical data classification has been proposed that handles the problem of unbalanced data. Four benchmark medical data-sets named chronic kidney disease (CKD), fertility, BUPA liver disorder, and Wisconsin diagnostic breast cancer (WDBC) have been taken from the University of California (UCI) machine learning repository. Classification is done using the proposed technique. The proposed technique achieved the best accuracy for CKD, WDBC, Fertility, and BUPA dataset as 100%, 99.12%, 85.0%, and 75.36% respectively, and the best AUC as 1.0, 0.99, 0.92, and 0.75 respectively. The result outcomes show an improvement over other GP and SVM methods that confirm the efficiency of our proposed algorithm.
Display omitted
•A novel fitness function in Genetic Programming for Medical Data Classification has been proposed.•Four benchmark medical data-sets, taken from the UCI repository, are classified using the proposed technique.•The performance of the proposed technique has been compared with Support Vector Machine (SVM) and other state-of-the-art works available in the literature.•The result outcomes show that the proposed technique gives either better or comparable performance than the SVM and other state-of-the-art works.
Flow chart of RFMSE.
Display omitted
•We propose a misclassification synthetic minority over-sampling technique, which overcomes the blindness of the synthetic minority over-sampling technique in ...synthesizing samples.•Our sampling algorithm optimally integrates the advantages of data resampling algorithm and random forest algorithm.•Our sampling algorithm uses the misclassification synthetic minority over-sampling technique and the edited nearest neighbor under-sampling technique to sample the imbalanced data.•Extensive experiments show that our sampling algorithm better than other data resampling algorithm in medical diagnosis.
The problem of imbalanced data classification often exists in medical diagnosis. Traditional classification algorithms usually assume that the number of samples in each class is similar and their misclassification cost during training is equal. However, the misclassification cost of patient samples is higher than that of healthy person samples. Therefore, how to increase the identification of patients without affecting the classification of healthy individuals is an urgent problem. In order to solve the problem of imbalanced data classification in medical diagnosis, we propose a hybrid sampling algorithm called RFMSE, which combines the Misclassification-oriented Synthetic minority over-sampling technique (M-SMOTE) and Edited nearset neighbor (ENN) based on Random forest (RF). The algorithm is mainly composed of three parts. First, M-SMOTE is used to increase the number of samples in the minority class, while the over-sampling rate of M-SMOTE is the misclassification rate of RF. Then, ENN is used to remove the noise ones from the majority samples. Finally, RF is used to perform classification prediction for the samples after hybrid sampling, and the stopping criterion for iterations is determined according to the changes of the classification index (i.e. Matthews Correlation Coefficient (MCC)). When the value of MCC continuously drops, the process of iterations will be stopped. Extensive experiments conducted on ten UCI datasets demonstrate that RFMSE can effectively solve the problem of imbalanced data classification. Compared with traditional algorithms, our method can improve F-value and MCC more effectively.
Broad Learning System (BLS) are widely used in many fields because of its strong feature extraction ability and high computational efficiency. However, the BLS is mainly used in supervised learning, ...which greatly limits the applicability of the BLS. And the obtained data is less labeled data, but is a large number of unlabeled data. Therefore, the BLS is extended based on the semi-supervised learning of manifold regularization framework to propose a semi-supervised broad learning system (SS-BLS). Firstly, the features are extracted from labeled and unlabeled data by building feature nodes and enhancement nodes. Then the manifold regularization framework is used to construct Laplacian matrix. Next, the feature nodes, enhancement nodes and Laplacian matrix are combined to construct the objective function, which is effectively solved by ridge regression in order to obtain the output coefficients. Finally, the validity of the SS-BLS is verified by three different complex data of G50C, MNIST, and NORB, respectively. The experiment result show that the SS-BLS can achieve higher classification accuracy for different complex data, takes on fast operation speed and strong generalization ability.
•We claim that SMOTE has a weakness when facing high-dimensional problems.•We propose a general version of the SMOTE strategy using OWA operators.•The proposal includes a feature weighting process ...that considers relevancy/redundancy.•This new component leads to a better definition of the neighborhood of minority samples.•Experiments carried out on 42 datasets show the virtues of our method.
The Synthetic Minority Over-sampling Technique (SMOTE) is a well-known resampling strategy that has been successfully used for dealing with the class-imbalance problem, one of the most challenging pattern recognition tasks in the last two decades. In this work, we claim that SMOTE has an important issue when defining the neighborhood in order to create new minority samples: the use of the Euclidean distance may not be suitable in high-dimensional settings. Our hypothesis is that the use of a weighted metric that does not assume that all features are equally important could improve performance in the presence of noisy/redundant variables. In this line, we present a novel SMOTE-like method that uses the weighted Minkowski distance for defining the neighborhood for each example of the minority class. This methodology leads to a better definition of the neighborhood since it prioritizes those features that are more relevant for the classification task. A complementary advantage of the proposal is performing feature selection since attributes can be discarded when their corresponding weights are below a given threshold. Our experiments on 42 class-imbalance datasets show the virtues of the proposed SMOTE variant, achieving the best predictive performance when compared with the traditional SMOTE approach and other recent variants on low- and high-dimensional settings, handling issues such as class overlap and hubness adequately without increasing the complexity of the method.
•The proposed approach applies self-adaptive cost-sensitive SVM as basic weak leaner.•The method modifies standard AdaBoost scheme to cost-sensitive one suitable for SVM.•The method ensures the ...consistency of optimization objectives of AdaBoost and SVM.•The method self-adaptively considers the different contribution of minority instances.•Costs update strategy can slightly skew the final boundary away from minority class.
Imbalanced data classification poses a major challenge in data mining community. Although standard support vector machine can generally show relatively robust performance in dealing with the classification problems of imbalanced data set, it is a typical overall accuracy-oriented algorithm which results in the final decision boundary biasing toward the majority class. Some ensemble methods have emerged as meta-techniques for improving the generalization performance of existing learning algorithms. In this paper, we propose a novel self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification. In the proposed approach, to guarantee the consistency of optimization objectives between weak learners and boosting scheme, we not only apply cost-sensitive SVMs as basic weak leaner but also simultaneously modify the standard boosting scheme to cost-sensitive ones. In order to ensure more training minority instances for successive classifiers, especially borderline minority instances, we also present a self-adaptive sequential misclassification cost weights determination method. The method can self-adaptively consider the different contribution of minority instances to the form of SVM classifiers at each iteration based on the preceding obtained classifier during boosting, which can allow it to produce diverse classifiers and thus improve its generalization performance. In the experiments, we analyze and discuss the effect of different parameters on the performance and some suggestions are also provided. The extensive experimental results on the different imbalanced datasets demonstrate that the proposed approach can achieve better generalization performance in terms of G-Mean and F-Measure as compared to the other existing imbalanced dataset classification techniques.
Display omitted
•Novel embedded feature selection approach for SVM for imbalanced data sets.•Optimization is performed via Quasi-Newton and Armijo Search.•Best classification performance is achieved ...in experiments on benchmark datasets.
In this work, we propose a novel feature selection approach designed to deal with two major issues in machine learning, namely class-imbalance and high dimensionality. The proposed embedded strategy penalizes the cardinality of the feature set via the scaling factors technique, and is used with two support vector machine (SVM) formulations designed to deal with the class-imbalanced problem, namely Cost Sensitive SVM, and Support Vector Data Description. The proposed concave formulations are solved via a Quasi-Newton update and Armijo line search. We performed experiments on 12 highly imbalanced microarray datasets using linear and Gaussian kernel, achieving the highest average predictive performance with our approach compared with the most well-known feature selection strategies.
Wind power curves play important roles in wind power forecasting, wind turbine condition monitoring, estimation of wind energy potential and wind turbine selection. In practice, it is a challenging ...task to produce reliable wind power curves from raw wind data due to the presence of outliers formed in unexpected conditions, e.g., wind curtailment and blade damage. This paper comprehensively reviews wind power curve modeling techniques from the perspective of modeling processes, i.e., wind data analyses, wind data preprocessing and various wind power curve models. Moreover, the performances of many popular power curve models are studied in different seasons and different wind farms. The results show that no universal wind power curve model can always perform better than other models under any environmental conditions. In general, there are three factors that affect the final wind power curves: data filtering approaches; wind power curve models; and choice of optimization strategies (especially the method applied to construct objective functions). However, there is no guarantee that all outliers will be removed from the raw wind data. Consequently, designing robust regression models or constructing robust objective functions may be two effective ways to obtain accurate power curves in the presence of outliers. The above two strategies depend largely on the error characteristics of power curve modeling. While it is often observed that the error distribution of the power curve modeling may be asymmetric, few researchers have considered this trait when building wind power curves. Therefore, this paper proposes several strategies that focus on designing asymmetric loss functions and developing robust regression models with asymmetric error distributions. Models that benefit from these characteristics may be more suitable for power curve modeling tasks and are more likely to produce better wind power curves.
•The different roles of wind power curve models in the utilization of wind energy are reviewed.•The paper discusses the classification of preprocessing approaches for the uncertainties in raw wind data in detail.•The paper provides a comprehensive review of the probabilistic and deterministic wind power curve models.•The performances of many power curve models are comprehensively compared under different environmental conditions.•The asymmetric error characteristic of power curve modeling is analyzed and considered for accuracy improvement.
With randomly generated weights between input and hidden layers, a random vector functional link network is a universal approximator for continuous functions on compact sets with fast learning ...property. Though it was proposed two decades ago, the classification ability of this family of networks has not been fully investigated yet. Through a very comprehensive evaluation by using 121 UCI datasets, the effect of bias in the output layer, direct links from the input layer to the output layer and type of activation functions in the hidden layer, scaling of parameter randomization as well as the solution procedure for the output weights are investigated in this work. Surprisingly, we found that the direct link plays an important performance enhancing role in RVFL, while the bias term in the output neuron had no significant effect. The ridge regression based closed-form solution was better than those with Moore–Penrose pseudoinverse. Instead of using a uniform randomization in −1,+1 for all datasets, tuning the scaling of the uniform randomization range for each dataset enhances the overall performance. Six commonly used activation functions were investigated in this work and we found that hardlim and sign activation functions degenerate the overall performance. These basic conclusions can serve as general guidelines for designing RVFL networks based classifiers.
This research introduces the Boosted Ensemble deep Multi-Layer Layer Perceptron (EdMLP) architecture with multiple output layers, a novel enhancement for the traditional Multi-Layer Perceptron (MLP). ...By adopting a layer-wise training approach, EdMLP enables the integration of boosting techniques within a single model, treating each layer as a weak learner, resulting in substantial performance gains. Additionally, the inclusion of layer-wise hyperparameter tuning allows optimization of individual layers thereby reducing the tuning time. Furthermore, the ensemble deep architecture’s versatility can be extended to other neural network-based models, such as the Self Normalized Network (SNN) where experiments demonstrate substantial performance enhancements yielded by the EdSNN compared to the standard original SNN model. This research underscores the potential of the EdMLP, and the Ed architecture in general as a powerful tool for improving the performance of various multilayer feedforward neural network models. The source code of this work is publicly accessible from the authors GitHub.
•Proposed EdMLP architecture with multiple output layers to enhance conventional MLP.•Layer-wise training integrates boosting in the second and subsequent output layers.•Layer-wise tuning cuts hyperparameter tuning time significantly.•Training a deep neural network with multiple output layers and boosting can improve other models.