Intrusion detection systems assume a noteworthy job in the provision of security in wireless Sensor networks. The existing intrusion detection systems focus only on the detection of the known types ...of attacks. However, it neglects to recognise the new types of attacks, which are introduced by malicious users leading to vulnerability and information loss in the network. In order to address this challenge, a new intrusion detection system, which detects the known and unknown types of attacks using an intelligent decision tree classification algorithm, has been proposed. For this purpose, a novel feature selection algorithm named dynamic recursive feature selection algorithm, which selects an optimal number of features from the data set is proposed. In addition, an intelligent fuzzy temporal decision tree algorithm is also proposed by extending the decision tree algorithm and integrated with convolution neural networks to detect the intruders effectively. The experimental analysis carried out using KDD cup data set and network trace data set demonstrates the effectiveness of this proposed approach. It proved that the false positive rate, energy consumption, and delay are reduced in the proposed work. In addition, the proposed system increases the network performance through increased packet delivery ratio.
Nontechnical losses, particularly due to electrical theft, have been a major concern in power system industries for a long time. Large-scale consumption of electricity in a fraudulent manner may ...imbalance the demand-supply gap to a great extent. Thus, there arises the need to develop a scheme that can detect these thefts precisely in the complex power networks. So, keeping focus on these points, this paper proposes a comprehensive top-down scheme based on decision tree (DT) and support vector machine (SVM). Unlike existing schemes, the proposed scheme is capable enough to precisely detect and locate real-time electricity theft at every level in power transmission and distribution (T&D). The proposed scheme is based on the combination of DT and SVM classifiers for rigorous analysis of gathered electricity consumption data. In other words, the proposed scheme can be viewed as a two-level data processing and analysis approach, since the data processed by DT are fed as an input to the SVM classifier. Furthermore, the obtained results indicate that the proposed scheme reduces false positives to a great extent and is practical enough to be implemented in real-time scenarios.
•A novel boosted tree model for credit scoring is proposed.•A hyper-parameter optimization technique is developed based on TPE algorithm.•The model is proved to outperform several baseline ...techniques.•The model is validated on five datasets over five performance metrics.•The feature importance scores and decision chart enhance model interpretation.
Credit scoring is an effective tool for banks to properly guide decision profitably on granting loans. Ensemble methods, which according to their structures can be divided into parallel and sequential ensembles, have been recently developed in the credit scoring domain. These methods have proven their superiority in discriminating borrowers accurately. However, among the ensemble models, little consideration has been provided to the following: (1) highlighting the hyper-parameter tuning of base learner despite being critical to well-performed ensemble models; (2) building sequential models (i.e., boosting, as most have focused on developing the same or different algorithms in parallel); and (3) focusing on the comprehensibility of models. This paper aims to propose a sequential ensemble credit scoring model based on a variant of gradient boosting machine (i.e., extreme gradient boosting (XGBoost)). The model mainly comprises three steps. First, data pre-processing is employed to scale the data and handle missing values. Second, a model-based feature selection system based on the relative feature importance scores is utilized to remove redundant variables. Third, the hyper-parameters of XGBoost are adaptively tuned with Bayesian hyper-parameter optimization and used to train the model with selected feature subset. Several hyper-parameter optimization methods and baseline classifiers are considered as reference points in the experiment. Results demonstrate that Bayesian hyper-parameter optimization performs better than random search, grid search, and manual search. Moreover, the proposed model outperforms baseline models on average over four evaluation measures: accuracy, error rate, the area under the curve (AUC) H measure (AUC-H measure), and Brier score. The proposed model also provides feature importance scores and decision chart, which enhance the interpretability of credit scoring model.
Water resources are the foundation of people’s life and economic development, and are closely related to health and the environment. Accurate prediction of water quality is the key to improving water ...management and pollution control. In this paper, two novel hybrid decision tree-based machine learning models are proposed to obtain more accurate short-term water quality prediction results. The basic models of the two hybrid models are extreme gradient boosting (XGBoost) and random forest (RF), which respectively introduce an advanced data denoising technique - complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN). Taking the water resources of Gales Creek site in Tualatin River (one of the most polluted rivers in the world) Basin as an example, a total of 1875 data (hourly data) from May 1, 2019 to July 20, 2019 are collected. Two hybrid models are used to predict six water quality indicators, including water temperature, dissolved oxygen, pH value, specific conductance, turbidity, and fluorescent dissolved organic matter. Six error metrics are introduced as the basis of performance evaluation, and the results of the two models are compared with the other four conventional models. The results reveal that: (1) CEEMDAN-RF performs best in the prediction of temperature, dissolved oxygen and specific conductance, the mean absolute percentage errors (MAPEs) are 0.69%, 1.05%, and 0.90%, respectively. CEEMDAN-XGBoost performs best in the prediction of pH value, turbidity, and fluorescent dissolved organic matter, the MAPEs are 0.27%, 14.94%, and 1.59%, respectively. (2) The average MAPEs of CEEMDAN-RF and CEEMMDAN-XGBoost models are the smallest, which are 3.90% and 3.71% respectively, indicating that their overall prediction performance is the best. In addition, the stability of the prediction model is also discussed in this paper. The analysis shows that the prediction stability of CEEMDAN-RF and CEEMDAN-XGBoost is higher than other benchmark models.
•Two hybrid decision tree-based models are proposed to predict the water quality.•An advanced denoising method is used to preprocess raw data.•The case study was conducted on the most polluted river Tualatin River in Oregon, USA.•The prediction stability of the model is analyzed.
•Use the median as the criterion for the feature weight threshold.•Use the improved ReliefF as a feature pre-filtering method.•Take the feature weight as the criterion for evaluating features in the ...decision tree.
In order to improve the classification accuracy, a preprocessing step is used to pre-filter some redundant or irrelevant features before decision tree construction. And a new feature selection algorithm FWDT is proposed based on this. Experimental results show that FWDT our proposed method performs better for the measures of accuracy, recall and F1-score. Furthermore, it can reduce the required time in constructing the decision tree.
Decision support systems help physicians and also play an important role in medical decision-making. They are based on different models, and the best of them are providing an explanation together ...with an accurate, reliable and quick response. This paper presents a decision support tool for the detection of breast cancer based on three types of decision tree classifiers. They are single decision tree (SDT), boosted decision tree (BDT) and decision tree forest (DTF). Decision tree classification provides a rapid and effective method of categorizing data sets. Decision-making is performed in two stages: training the classifiers with features from Wisconsin breast cancer data set, and then testing. The performance of the proposed structure is evaluated in terms of accuracy, sensitivity, specificity, confusion matrix and receiver operating characteristic (ROC) curves. The results showed that the overall accuracies of SDT and BDT in the training phase achieved 97.07 % with 429 correct classifications and 98.83 % with 437 correct classifications, respectively. BDT performed better than SDT for all performance indices than SDT. Value of ROC and Matthews correlation coefficient (MCC) for BDT in the training phase achieved 0.99971 and 0.9746, respectively, which was superior to SDT classifier. During validation phase, DTF achieved 97.51 %, which was superior to SDT (95.75 %) and BDT (97.07 %) classifiers. Value of ROC and MCC for DTF achieved 0.99382 and 0.9462, respectively. BDT showed the best performance in terms of sensitivity, and SDT was the best only considering speed.
<inline-formula> <tex-math notation="LaTeX">{k} </tex-math></inline-formula> nearest neighbor (kNN) method is a popular classification method in data mining and statistics because of its simple ...implementation and significant classification performance. However, it is impractical for traditional kNN methods to assign a fixed <inline-formula> <tex-math notation="LaTeX">{k} </tex-math></inline-formula> value (even though set by experts) to all test samples. Previous solutions assign different <inline-formula> <tex-math notation="LaTeX">k </tex-math></inline-formula> values to different test samples by the cross validation method but are usually time-consuming. This paper proposes a kTree method to learn different optimal <inline-formula> <tex-math notation="LaTeX">k </tex-math></inline-formula> values for different test/new samples, by involving a training stage in the kNN classification. Specifically, in the training stage, kTree method first learns optimal <inline-formula> <tex-math notation="LaTeX">k </tex-math></inline-formula> values for all training samples by a new sparse reconstruction model, and then constructs a decision tree (namely, kTree) using training samples and the learned optimal <inline-formula> <tex-math notation="LaTeX">k </tex-math></inline-formula> values. In the test stage, the kTree fast outputs the optimal <inline-formula> <tex-math notation="LaTeX">k </tex-math></inline-formula> value for each test sample, and then, the kNN classification can be conducted using the learned optimal <inline-formula> <tex-math notation="LaTeX">k </tex-math></inline-formula> value and all training samples. As a result, the proposed kTree method has a similar running cost but higher classification accuracy, compared with traditional kNN methods, which assign a fixed <inline-formula> <tex-math notation="LaTeX">{k} </tex-math></inline-formula> value to all test samples. Moreover, the proposed kTree method needs less running cost but achieves similar classification accuracy, compared with the newly kNN methods, which assign different <inline-formula> <tex-math notation="LaTeX">{k} </tex-math></inline-formula> values to different test samples. This paper further proposes an improvement version of kTree method (namely, k*Tree method) to speed its test stage by extra storing the information of the training samples in the leaf nodes of kTree, such as the training samples located in the leaf nodes, their kNNs, and the nearest neighbor of these kNNs. We call the resulting decision tree as k*Tree, which enables to conduct kNN classification using a subset of the training samples in the leaf nodes rather than all training samples used in the newly kNN methods. This actually reduces running cost of test stage. Finally, the experimental results on 20 real data sets showed that our proposed methods (i.e., kTree and k*Tree) are much more efficient than the compared methods in terms of classification tasks.
Data stream mining has recently grown in popularity, thanks to an increasing number of applications which need continuous and fast analysis of streaming data. Such data are generally produced in ...application domains that require immediate reactions with strict temporal constraints. These particular characteristics make problematic the use of classical machine learning algorithms for mining knowledge from these fast data streams and call for appropriate techniques. In this paper, based on the well-known Hoeffding Decision Tree (HDT) for streaming data classification, we introduce FHDT, a fuzzy HDT that extends HDT with fuzziness, thus making HDT more robust to noisy and vague data. We tested FHDT on three synthetic datasets, usually adopted for analyzing concept drifts in data stream classification, and two real-world datasets, already exploited in some recent researches on fuzzy systems for streaming data. We show that FHDT outperforms HDT, especially in presence of concept drift. Furthermore, FHDT is characterized by a high level of interpretability, thanks to the linguistic rules that can be extracted from it.