Fuzzy clustering algorithms generally treat data points with feature components under equal importance. However, there are various datasets with irrelevant features involved in clustering process ...that may cause bad performance for fuzzy clustering algorithms. That is, different feature components should take different importance. In this paper, we present a novel method for improving fuzzy clustering algorithms that can automatically compute individual feature weight, and simultaneously reduce these irrelevant feature components. In fuzzy clustering, the fuzzy c-means (FCM) algorithm is the best known. We first consider the FCM objective function with feature-weighted entropy, and construct a learning schema for parameters, and then reduce these irrelevant feature components. We call it a feature-reduction FCM (FRFCM). During FRFCM processes, a new procedure for eliminating irrelevant feature(s) with small weight(s) is created for feature reduction. The computational complexity of FRFCM is also analyzed. Some numerical and real datasets are used to compare FRFCM with various feature-weighted FCM methods in the literature. Experimental results and comparisons actually demonstrate these good aspects of FRFCM with its effectiveness and usefulness in practice.
Display omitted
•RF, SVM were applied and compared to predict pyrolytic gas yield and compositions.•Feature reduction was applied to improve performances (R2 > 0.85, RMSE < 5.7%).•The importance of ...features for different targets was identified.•Partial dependence analysis provided new insights for pyrolysis process.
This study aimed to utilize machine learning algorithems combined with feature reduction for predicting pyrolytic gas yield and compositions based on pyrolysis conditions and biomass characteristics. To this end, random forest (RF) and support vector machine (SVM) was introduced and compared. The results suggested that six features were adequate to accurately forecast (R2 > 0.85, RMSE < 5.7%) the yield while the compositions only required three. Moreover, the profound information behind the models was extracted. The relative contribution of pyrolysis conditions was higher than that of biomass characteristics for yield (55%), CO2 (73%), and H2 (81%), which was inverse for CO (12%) and CH4 (38%). Furthermore, partial dependence analysis quantified the effects of both reduced features and their interactions exerted on pyrolysis process. This study provided references for pyrolytic gas production and upgrading in a more convenient manner with fewer features and extended the knowledge into the biomass pyrolysis process.
Improved Random Forest for Classification Paul, Angshuman; Mukherjee, Dipti Prasad; Das, Prasun ...
IEEE transactions on image processing,
2018-Aug., 2018-08-00, 2018-8-00, 20180801, Letnik:
27, Številka:
8
Journal Article
Recenzirano
We propose an improved random forest classifier that performs classification with a minimum number of trees. The proposed method iteratively removes some unimportant features. Based on the number of ...important and unimportant features, we formulate a novel theoretical upper limit on the number of trees to be added to the forest to ensure improvement in classification accuracy. Our algorithm converges with a reduced but important set of features. We prove that further addition of trees or further reduction of features does not improve classification performance. The efficacy of the proposed approach is demonstrated through experiments on benchmark data sets. We further use the proposed classifier to detect mitotic nuclei in the histopathological data sets of breast tissues. We also apply our method on the industrial data set of dual-phase steel microstructures to classify different phases. Results of our method on different data sets show significant reduction in an average classification error compared with a number of competing methods.
•We compare performance for random forest variable selection methods.•VSURF or Jiang's method are preferable for most datasets.•varSelRF or Boruta perform well for data with >50 predictors.•Methods ...with conditional random forest usually have similar performance.•Type of methods, test- or performance-based, is not likely to impact performance.
Random forest classification is a popular machine learning method for developing prediction models in many research settings. Often in prediction modeling, a goal is to reduce the number of variables needed to obtain a prediction in order to reduce the burden of data collection and improve efficiency. Several variable selection methods exist for the setting of random forest classification; however, there is a paucity of literature to guide users as to which method may be preferable for different types of datasets. Using 311 classification datasets freely available online, we evaluate the prediction error rates, number of variables, computation times and area under the receiver operating curve for many random forest variable selection methods. We compare random forest variable selection methods for different types of datasets (datasets with binary outcomes, datasets with many predictors, and datasets with imbalanced outcomes) and for different types of methods (standard random forest versus conditional random forest methods and test based versus performance based methods). Based on our study, the best variable selection methods for most datasets are Jiang's method and the method implemented in the VSURF R package. For datasets with many predictors, the methods implemented in the R packages varSelRF and Boruta are preferable due to computational efficiency. A significant contribution of this study is the ability to assess different variable selection techniques in the setting of random forest classification in order to identify preferable methods based on applications in expert and intelligent systems.
The hyperspectral remote sensing images (HSIs) are acquired to encompass the essential information of land objects through contiguous narrow spectral wavelength bands. The classification accuracy is ...not often satisfactory in a cost-effective way using the entire original HSI for practical applications. To enhance the classification result of HSIs the band reduction strategies are applied which can be divided into feature extraction and feature selection methods. PCA (Principal Component Analysis), a linear unsupervised statistical transformation, is frequently adopted for the extraction of features from HSIs. In this paper, PCA and SPCA (Segmented-PCA), SSPCA (Spectrally Segmented-PCA), FPCA (Folded-PCA) and MNF (Minimum Noise Fraction) as linear variants of PCA together with KPCA (Kernel-PCA) and KECA (kernel Entropy Component Analysis) as nonlinear variants of PCA have been investigated. The top transformed features were picked out using accumulation of variance for all other feature extraction methods except for MNF and KECA. MNF uses SNR (Signal-to-Noise Ratio) values and KECA employs Renyi quadratic entropy measurement for this purpose. The studied approaches are equated and analyzed for Indian Pine agricultural and urban Washington DC Mall HSI classification using SVM (Support Vector Machine) classifier. The experiment illustrates that the costly effective and improved classification performance of the feature extraction approaches over the performance using the entire original dataset. MNF offers the highest classification accuracy and FPCA offers the least space and time complexity with satisfactory classification result.
•Utilise the NSL-KDD data set and the binary and multiclass problem with a 20% training dataset.•This paper studied a new model that can be used to estimate the intrusion scope threshold degree.•The ...experimental result revealed that the hybrid approach had a significant effect on the minimisation of the computational and time complexity.•The accuracy of the proposed model was satisfactory at 99.77% and 99.63% for the binary class and multiclass NSL-KDD data sets, respectively.
Efficiently detecting network intrusions requires the gathering of sensitive information. This means that one has to collect large amounts of network transactions including high details of recent network transactions. Assessments based on meta-heuristic anomaly are important in the intrusion related network transaction data’s exploratory analysis. These assessments are needed to make and deliver predictions related to the intrusion possibility based on the available attribute details that are involved in the network transaction. We were able to utilize the NSL-KDD data set, the binary and multiclass problem with a 20% testing dataset. This paper develops a new hybrid model that can be used to estimate the intrusion scope threshold degree based on the network transaction data’s optimal features that were made available for training. The experimental results revealed that the hybrid approach had a significant effect on the minimisation of the computational and time complexity involved when determining the feature association impact scale. The accuracy of the proposed model was measured as 99.81% and 98.56% for the binary class and multiclass NSL-KDD data sets, respectively.
However, there are issues with obtaining high false and low false negative rates. A hybrid approach with two main parts is proposed to address these issues. First, data needs to be filtered using the Vote algorithm with Information Gain that combines the probability distributions of these base learners in order to select the important features that positively affect the accuracy of the proposed model. Next, the hybrid algorithm consists of following classifiers: J48, Meta Pagging, RandomTree, REPTree, AdaBoostM1, DecisionStump and NaiveBayes. Based on the results obtained using the proposed model, we observe improved accuracy, high false negative rate, and low false positive rule.
Dimensionality Reduction (DR) is the pre-processing step to remove redundant features, noisy and irrelevant data, in order to improve learning feature accuracy and reduce the training time. ...Dimensionality reductions techniques have been proposed and implemented by using feature selection and extraction method. Principal Component Analysis (PCA) one of the Dimensions reduction techniques which give reduced computation time for the learning process. In this paper presents most widely used feature extraction techniques such as EMD, PCA, and feature selection techniques such as correlation, LDA, forward selection have been analyzed based on high performance and accuracy. These techniques are highly applied in Deep Neural Network for medical image diagnosis and used to improve the classification accuracy. Further, we discussed how dimension reduction is made in deep learning.
•We propose two multi-label learning approaches with LIFT reduction.•The idea of fuzzy rough set attribute reduction is adopted in our approaches.•Sample selection improves the efficiency in feature ...dimension reduction.
In multi-label learning, since different labels may have some distinct characteristics of their own, multi-label learning approach with label-specific features named LIFT has been proposed. However, the construction of label-specific features may encounter the increasing of feature dimensionalities and a large amount of redundant information exists in feature space. To alleviate this problem, a multi-label learning approach FRS-LIFT is proposed, which can implement label-specific feature reduction with fuzzy rough set. Furthermore, with the idea of sample selection, another multi-label learning approach FRS-SS-LIFT is also presented, which effectively reduces the computational complexity in label-specific feature reduction. Experimental results on 10 real-world multi-label data sets show that, our methods can not only reduce the dimensionality of label-specific features when compared with LIFT, but also achieve satisfactory performance among some popular multi-label learning approaches.
•A Feature reduced Intrusion Detection System has been proposed.•Pre-processing is done to compensate less occurring and frequent occurring attacks.•Feature reduction has been done on basis of ...information gain and correlation.•A classifier based on artificial neural network has been used.•Comparison with state of art methods has been done.
Rapid increase in internet and network technologies has led to considerable increase in number of attacks and intrusions. Detection and prevention of these attacks has become an important part of security. Intrusion detection system is one of the important ways to achieve high security in computer networks and used to thwart different attacks. Intrusion detection systems have curse of dimensionality which tends to increase time complexity and decrease resource utilization. As a result, it is desirable that important features of data must be analyzed by intrusion detection system to reduce dimensionality. This work proposes an intelligent system which first performs feature ranking on the basis of information gain and correlation. Feature reduction is then done by combining ranks obtained from both information gain and correlation using a novel approach to identify useful and useless features. These reduced features are then fed to a feed forward neural network for training and testing on KDD99 dataset. Pre-processing of KDD-99 dataset has been done to normalize number of instances of each class before training. The system then behaves intelligently to classify test data into attack and non-attack classes. The aim of the feature reduced system is to achieve same degree of performance as a normal system. The system is tested on five different test datasets and both individual and average results of all datasets are reported. Comparison of proposed method with and without feature reduction is done in terms of various performance metrics. Comparisons with recent and relevant approaches are also tabled. Results obtained for proposed method are really encouraging.