Automated feature selection is important for text categorization to reduce feature size and to speed up learning process of classifiers. In this paper, we present a novel and efficient feature ...selection framework based on the Information Theory, which aims to rank the features with their discriminative capacity for classification. We first revisit two information measures: Kullback-Leibler divergence and Jeffreys divergence for binary hypothesis testing, and analyze their asymptotic properties relating to type I and type II errors of a Bayesian classifier. We then introduce a new divergence measure, called Jeffreys-Multi-Hypothesis (JMH) divergence, to measure multi-distribution divergence for multi-class classification. Based on the JMH-divergence, we develop two efficient feature selection methods, termed maximum discrimination (<inline-formula><tex-math notation="LaTeX">MD</tex-math> <inline-graphic xlink:type="simple" xlink:href="he-ieq1-2563436.gif"/> </inline-formula>) and methods, for text categorization. The promising results of extensive experiments demonstrate the effectiveness of the proposed approaches.
•A noverl feature extraction method to reduce HSI dimensions, called IDA is proposed.•IDA is applied to improve data correlation and distribution and classes variance.•Study the impact of feature ...extraction and preproccesing on the HSI classification.•Using different datasets with well-known exisiting methods to test the IDA.
Hyperspectral images (HSIs) are known for their high dimensionality and wide spectral bands that increase redundant information and complicate classification. Outliers and mixed data are common problems in HSIs. Thus, preprocessing methods are essential in enhancing and reducing data complexity, redundant information, and the number of bands. This study introduces a novel feature reduction method (FRM) called improving distribution analysis (IDA). IDA works to increase the correlation between related data, decrease the distance between big and small data, and correct each value's location to be inside its group range. In IDA, the input data passes through three stages. Getting rid of outliers and improving data correlation is the first step. The second stage involves increasing the variance. The third is to simplify the data and normalize the distribution. IDA is compared with four popular FRMs in four available HSIs. It is also tested and evaluated in various classification models, including spatial, spectral, and spectral-spatial models. The experimental results demonstrate that IDA performs admirably in enhancing data distribution, reducing complexity, and accelerating performance.
The correct identification of individuals through different biometric traits is becoming increasingly important. Apart from traditional biomarkers (like fingerprints), many alternative measures have ...been proposed during the last two decades: electrocardiogram (ECG) and electroencephalogram (EEG) signals, iris or facial recognition, conductual traits, etc. Several works have shown that ECG-based recognition is a feasible alternative, either for stand-alone or multi-biometric recognition systems. In this paper, we propose a novel framework for ECG-based biometric identification, consisting of a simple and robust feature extraction approach and a clustering-based feature reduction method, that enables for an efficient and scalable biometric identification. The proposed feature reduction approach is a two phase method: it uses a clustering algorithm to group features according to their similarities first, and then clusters are represented in terms of a prototype vector and associated to the available subjects. On its side, the proposed time-domain feature extraction method is a semi-fiducial procedure, where the well-known Pan–Tompkins algorithm is first used to detect the R wave peaks of the QRS complexes, and then fixed-width time segments are selected for further dimensionality reduction and feature extraction. The resulting combined methods are efficient, robust, scalable and attain excellent results (with up-to 98.6% sensitivity) on all the subjects of the Physikalisch-Technische Bundesanstalt (PTB) database, regardless of their pathological or healthy status. Additionally, we also show how the existing Auto Correlation/Discrete Cosine Transform (AC/DCT)-based non-fiducial feature extraction method can be integrated within our framework, allowing us to attain up to 90.6% sensitivity on the Massachusetts Institute of Technology-Beth Israel Hospital (MIT-BIH) arrhythmia database. Since this database is much noisier and has a much lower sampling rate (360 Hz instead of 1000 Hz), we claim that this is a very good result.
•We propose a novel efficient framework for ECG-based biometric identification.•A clustering-based classifier is used to reduce the computational/storage cost.•Hierarchical agglomerative clustering (HAC) is used to build the clusters.•A novel semi-fiducial time-domain feature extraction method is proposed.•Statistical analysis of P-QRS-T complexes is performed on MIT-BIH arrhythmia and PTB.
► This paper proposes a desirable IDS model with high efficiency and accuracy. ► It formulates a pipeline of machine learning methods, including
k-means algorithm, ant colony optimization (ACO) and ...SVM. ► The accuracy achieves 98.6249%, and the average Matthews correlation coefficient (MCC) achieves 0.861161.
The efficiency of the intrusion detection is mainly depended on the dimension of data features. By using the gradually feature removal method, 19 critical features are chosen to represent for the various network visit. With the combination of clustering method, ant colony algorithm and support vector machine (SVM), an efficient and reliable classifier is developed to judge a network visit to be normal or not. Moreover, the accuracy achieves 98.6249% in 10-fold cross validation and the average Matthews correlation coefficient (MCC) achieves 0.861161.
•Human emotion recognition plays an essential role in human-computer interactions•Deep time-frequency features were extracted from the modified Stockwell transform of each channel of EEG ...signals•Fusion was applied on reduced features obtained from a semi-supervised dimension reduction algorithm•Inception-V3 for deep feature extraction and SVM for classification yields the highest accuracy of DEAP and SEED datasets.
In recent years, human emotion recognition has received great attention since it plays an essential role in human-computer interactions. Traditional methods focused on electroencephalogram(EEG) analysis in either time or frequency domain are unsuitable since EEG signals are nonlinear. This paper proposes subject-independent human emotion recognition from multi-channel EEG signals. The proposed method first obtains the time-frequency content of each channel using the modified Stockwell transform and then extracts the deep features from each time-frequency content by a deep convolutional neural network (CNN). Since there is a huge number of deep features, semi-supervised dimension reduction (SSDR) is utilized to reduce them, and reduced features of all channels are fused to construct the final feature vector. Several CNNs and classifiers are examined respectively for deep feature extraction and classification. The classification considering two-class and four-class scenarios on the DEAP dataset and three-class scenario on the SEED dataset show that the Inception-V3 CNN and support vector machine (SVM) classifier yields the highest accuracy. We present the extensive simulation to present the efficiency of the proposed method. Also, performance comparison with current methods based on time-frequency analysis demonstrates that the proposed method outperforms the others.
•To analyze the workload of fighter pilots in a realistic fighter flight simulator environment by monitoring his heart rate variability (HRV) parameters.•To analyze the workload of fighter pilots in ...a realistic high fidelity fighter flight simulator environment by monitoring his EEG parameters.•Analysing the statistically significant features of EEG and HRV.•Classification of pilots cognitive workload using physiological signals by Machine Learning(SVM,kNN and LDA) approach.
In general, the fighter pilots are required to engage themselves entirely during flight operations involved in air-to-air combat while in the cockpit of a fighter aircraft. The performance has to be monitored continuously by classifying their cognitive workload levels during different phases of flying. Towards this direction, an experimental study was conducted in a realistic high-fidelity flight simulator environment to classify the Pilots’ Cognitive Workload (PCWL) level. A real-time implementation of algorithms to effectively organize the PCWL during takeoff, cruise and landing phases, physiological signals such as ECG and EEG of fighter pilots are used. The classification algorithms such as Linear Discriminant Analysis (LDA) classifier, Support Vector Machine (SVM) classifier, k-Nearest Neighbour (k-NN) classifier have been employed. It has resulted that takeoff (LDA – 75%, kNN – 60% and SVM – 75%) and landing phase (LDA – 75%, kNN – 60% and SVM – 75%) was better classified by HRV features while using PCA and cruise phase was classified better using EEG features (LDA – 72.44%, kNN – 62.92% and SVM – 59.02%) when PCA feature reduction technique was adopted. Using significant features by feature selection methods (PCA, statistically significant features) have shown improved classification accuracy compared to all the features classification method. The LDA and SVM are consistent classifiers compare to the kNN classifier. This study helps to classify the PCWL level at each flying phase due to increased task.
Feature selection techniques have been presented to allow us to choose a small subset of the original components’ relevant features by removing irrelevant or redundant features. Feature selection is ...essential for many reasons such as simplification, performance, computational efficiency, and quality interpretability. Owing to the importance mentioned above, many researchers have proposed and developed many algorithms to solve the feature selection problem. Although these approaches produce useful results, they possess some shortcomings like inadequate feature reduction. In this paper, a novel feature selection algorithm based on the crow search algorithm is presented. The algorithm uses dynamic awareness probability to keep the balance between the local and global search processes. Moreover, a novel neighborhood assigning strategy has been introduced to optimize the local search. Considering the best-selected features in each iteration helps attain more benefits in global search. The main superiority of the proposed algorithm is the significant feature reduction along with retaining the accuracy. Compared to enhanced crow search algorithm, the proposed algorithm has improved the feature reduction metric and fitness metric by 27.12% and 5.16%, respectively, while losing the accuracy metric by only 0.53%. Several popular UCI datasets have been employed to evaluate the proposed feature selection algorithm. The experimental results show that the proposed algorithm outperformed other feature selection algorithms in state-of-the-art related works regarding feature reduction and accuracy.
•Improve the balance between the local and global search.•Introducing a new neighborhood concept for improving the local search.•Proposing a new method to search more purposeful during exploration.•Increasing convergence rate using chaos.•Being pioneer in reducing the dataset volume.
Recently developed imbalanced data classification models are mainly focused on the majority class samples. In addition, several whale optimization-based feature reduction models are inefficient for ...high-dimensional data, readily fall into the local optimum, and are subject to difficulties associated with achieving a global optimal feature subset, due to high costs. To overcome the drawbacks, in this study, a two-stage feature reduction model using fuzzy neighborhood rough sets (FNRS) and the binary whale optimization algorithm (BWOA) was developed for imbalanced data classification. First, to indicate the sample fuzziness of mixed data, a similarity measure between samples based on fuzzy neighborhood was defined to investigate the similarity matrix and fuzzy neighborhood granule, and a new FNRS model was presented by constructing lower and upper approximations. By considering the uneven distribution of classes, the boundary-based feature significance measure was developed to minimize the influence of the uncertainty in boundary region for imbalanced data. Second, fuzzy neighborhood roughness and decision entropy were investigated based on FNRS, and by integrating these above measures, fuzzy neighborhood decision entropy was proposed to evaluate the fuzziness and roughness of the fuzzy neighborhood for imbalanced data. The external and internal significance metrics were proposed to achieve the preselected feature subset in the first stage. Third, in this second stage, a new control factor was defined to control the position of whales, and a novel fitness function was developed to evaluate the selected feature subset from imbalanced datasets. Thereafter, the immune regulation strategy of artificial immune was introduced into the BWOA to design the mixed selection probability, to divide the whale population. Two local interference strategies were applied to adjust the whale position and prevent BWOA trapped in the local optimum. Thus, an optimal feature subset was achieved by constantly iterating the BWOA. Finally, a two-stage feature reduction algorithm was designed to handle imbalanced and high-dimensional data, where the particle swarm optimization (PSO) algorithm was employed to determine the different optimized parameters for this two-stage algorithm. Experiments conducted on 22 datasets revealed that the proposed algorithm is efficient for two-class and multiclass datasets.
•The boundary-based feature significance measure was developed to minimize the influence of the uncertainty.•Fuzzy neighborhood decision entropy was proposed to evaluate the fuzziness and roughness of imbalanced data.•A fitness function was developed to evaluate the selected feature subset from imbalanced data.•A two-stage feature reduction algorithm was designed to handle imbalanced and high-dimensional data.
Display omitted
•A new XGBoost prediction model for real-time SSC prediction in laser welding.•68.6 % increase rate of the new model from 0.2947 to 0.9383.•Li I at 395.09 nm shows the highest ...importance followed by Al I at 669.84 nm, Mg I at 518.4 nm and Ar I.•A good linear and positive correlation between the spectrum intensity of Mg I (517.27 nm) and seam strength coefficient.
This paper studies the regression prediction of laser welding seam strength of aluminum-lithium alloy used in the rocket storage tank by means of the optical spectrum and extreme gradient boosting decision tree (XGBoost). First, the relationship between the spectrum intensity and the seam strength coefficient is thoroughly investigated through parameters changing experiments using the developed monitoring system of the optical spectrum. Then, the importance of the metal line spectrum, including Al I, Li I, and Mg I, is quantitatively evaluated, and good complementarity between the Random Fores(RF)t and Principal Component Analysis(PCA) is demonstrated. Finally, a novel regression model, e.g., RFPCA-XGBoost is proposed and is compared with other different feature selection methods, tree-based ensemble learning models and grid search parameters optimization, and the comparison results show that among all the methods, the proposed model has the best performance regarding the R2 value, achieving the R2 value of 0.9383.