Width deviation is an important metric for evaluating the quality of a hot-rolled strip in steel production systems. This paper considers a width deviation prediction problem and proposes a ...Machine-learning and Genetic-algorithm-based Hybrid method named MGH to obtain a prediction model. Existing work mainly focuses on high prediction accuracy, while ignoring interpretability. This work aims to build a prediction model that can make a good trade-off between two industry-required criteria, i.e., prediction accuracy and interpretability. It first collects some process variables in a hot rolling process and includes them as well as some constructed variables in a feature pool. Then we propose MGH to find representative variables from it and build a prediction model. MGH results from the integration of hierarchical clustering, genetic algorithm, and generalized linear regression. In detail, hierarchical clustering is applied to divide variables into clusters. Genetic algorithm and generalized linear regression are innovatively combined to select a representative variable from each cluster and develop a prediction model. The computational experiments conducted on both industrial and public datasets show that the proposed method can effectively balance prediction accuracy and interpretability of its resulting model. It has better overall performance than the compared state-of-the-art models.
Feature selection is an effective data pre-processing technique that aims to select useful features. This technique has been widely applied in machine learning and data mining to improve the ...performance of learning models in downstream tasks. However, traditional feature selection approaches face two issues: (1) ignoring the essential interaction between features, and (2) sacrificing the high-level information hidden in features. Thus, the fuzzy technique is applied to frame a unified architecture named fuzzy feature factorization machine (F3M). Essentially, F3M leverages a scheme of pairwise feature combination that serves as a basis for bridging feature interaction, selection and feature construction. Specifically, pairwise feature combination is used to exploit feature interaction for fuzzy criteria-based feature selection and induce fuzzy feature interaction information for high-level feature construction. It is worth noting that F3M is a general framework that allows the use of any feature selection procedure. The superiority of F3M is validated using 17 public datasets and 5 schizophrenia datasets. The experimental results demonstrate that F3M can identify features with better classification performance, can be effectively applied to noisy data, and can also contribute to the prediction of schizophrenia incidence.
•A novel concept of pairwise feature factorization is defined.•A novel fuzzy measure criterion is presented.•A novel feature construction strategy is proposed based on the selected features.
Feature construction has shown promise in improving the accuracy of crop classification by constructing high-level features. However, current feature construction methods often rely on domain ...knowledge and have a limited interpretability of the solutions. To address this, this study proposes a new genetic programming (GP) approach to automatically evolve solutions with high interpretability that can construct high-level features for crop classification from hyperspectral images. A flexible representation of multiple trees is proposed in the proposed GP approach to construct various types of high-level features from the original ones, simultaneously. To improve the search ability, a new offspring generation method is developed to dynamically guide the evolution of the population while improving the diversity of the population. The new approach wraps with three classification algorithms, i.e., support vector machine (SVM), naive Bayes (NB), and k-kearest neighbour (KNN), for crop classification on three datasets with different difficulties and tasks. The results demonstrate that the features constructed by the new approach can effectively distinguish different crop categories. The new approach achieves better performance than the compared GP-based method, classic methods, and deep learning methods in crop classification using hyperspectral images. Importantly, the proposed approach shows the high interpretability of the constructed features.
•A method FCSMI is proposed to handle imbalace in multi-label datasets.•The FCSMI method combines feature construction and SMOTE method.•FCSMI uses distances from minority instances as feature for ...SMOTE.•Experimental results approve that FCSMI improves the classification performance.
The class-imbalance is intrinsic in Multi-label datasets due to the higher number of labels, few relevant labels in many instances, and a varied number of relevant instances for different labels. It causes multi-label learning biased toward majority instances for many labels. Therefore, it is essential to handle the multi-label datasets’ imbalances before using any multi-label learning algorithm. There are a handful of proposals in recent times to extricate this problem. However, it is still a significant challenge to date. This paper proposes a method termed Feature Construction and Smote-based Imbalance handling (FCSMI) for multi-label datasets. The FCSMI works in a label-wise manner as follows. First, it determines whether the label is a minority based on the mean imbalance ratio. Further, for minority labels, it calculates the distances of each instance from all the minority-instances. These distances are used as features. Then, it uses Smote to balance the ratio between minority and majority instances. Finally, the dataset, which has a lower imbalance ratio than its initial counterpart, is used to train the classifier. The experimental results demonstrate the effectiveness of our proposed FCSMI method compared to the prevailing state-of-the-art multi-label sampling methods.
•Genetic programming (GP) is the most suitable technique for feature construction. This paper investigates what are the key factors and how they influence the performance of different approaches to ...GP for multiple feature construction on highdimensional data.•In terms of representation, a multi-tree representation achieves better classification performance than a single-tree representation.•In terms of evaluation, an appropriate combination of filter measures is more effective and efficient than a hybrid combination of wrapper and filter.•In multi-tree GP for feature construction, the class-dependent constructed features achieved significantly better classification performance than the class-independent ones.
Data representation is an important factor in deciding the performance of machine learning algorithms including classification. Feature construction (FC) can combine original features to form high-level ones that can help classification algorithms achieve better performance. Genetic programming (GP) has shown promise in FC due to its flexible representation. Most GP methods construct a single feature, which may not scale well to high-dimensional data. This paper aims at investigating different approaches to constructing multiple features and analysing their effectiveness, efficiency, and underlying behaviours to reveal the insight of multiple-feature construction using GP on high-dimensional data. The results show that multiple-feature construction achieves significantly better performance than single-feature construction. In multiple-feature construction, using multi-tree GP representation is shown to be more effective than using the single-tree GP thanks to the ability to consider the interaction of the newly constructed features during the construction process. Class-dependent constructed features achieve better performance than the class-independent ones. A visualisation of the constructed features also demonstrates the interpretability of the GP-based FC approach, which is important to many real-world applications.
Modern intelligent power grid provides an efficient way of managing energy supply and consumption while facing numerous security threats at the same time. Both natural and man-made events can cause ...power system disturbance. Therefore, it is important for operators to identify the specific causes and types of disturbance in the power system to make decisions and respond appropriately. In order to address this problem, this paper proposes an attack detection model for power system based on machine learning that can be trained by using information and logs collected by phasor measurement units (PMUs). We carry out feature construction engineering, and then send the data to different machine learning models, in which random forest is chosen as the basic classifier of AdaBoost. The model is evaluated using open-source simulated power system data, which consists of 37 power system event scenarios. Finally, we compare the proposed model with other models by using different evaluation metrics. As the experimental results demonstrate that this model can achieve accuracy rate of 93.91% and detection rate of 93.6%, higher than eight recently developed techniques.
Feature construction and feature selection are two common pre-processing methods for classification. Genetic Programming (GP) can be used to solve feature construction and feature selection tasks due ...to its flexible representation. In this paper, a filter-based multiple feature construction approach using GP named FCM that stores top individuals is proposed, and a filter-based feature selection approach using GP named FS that uses correlation-based evaluation method is employed. A hybrid feature construction and feature selection approach named FCMFS that first constructs multiple features using FCM then selects effective features using FS is proposed. Experiments on nine datasets show that features selected by FS or constructed by FCM are all effective to improve the classification performance comparing with original features, and our proposed FCMFS can maintain the classification performance with smaller number of features comparing with FCM, and can obtain better classification performance with smaller number of features than FS on the majority of the nine datasets. Compared with another feature construction and feature selection approach named FSFCM that first selects features using FS then constructs features using FCM, FCMFS achieves better performance in terms of classification and the smaller number of features. The comparisons with three state-of-art techniques show that our proposed FCMFS approach can achieve better experimental results in most cases.
The use of background knowledge is largely unexploited in text classification tasks. This paper explores word taxonomies as means for constructing new semantic features, which may improve the ...performance and robustness of the learned classifiers. We propose tax2vec, a parallel algorithm for constructing taxonomy-based features, and demonstrate its use on six short text classification problems: prediction of gender, personality type, age, news topics, drug side effects and drug effectiveness. The constructed semantic features, in combination with fast linear classifiers, tested against strong baselines such as hierarchical attention neural networks, achieves comparable classification results on short text documents. The algorithm’s performance is also tested in a few-shot learning setting, indicating that the inclusion of semantic features can improve the performance in data-scarce situations. The tax2vec capability to extract corpus-specific semantic keywords is also demonstrated. Finally, we investigate the semantic space of potential features, where we observe a similarity with the well known Zipf’s law.
In recent years, genetic programming has achieved impressive results on evolutionary feature construction tasks. To increase search effectiveness, researchers have developed many semantic-based ...crossover and mutation operators to guide genetic programming searches toward the target semantics. However, semantics has not yet been explored for the hoist mutation operator, which is an operator designed for controlling the bloat effect. Although the hoist mutation operator can significantly reduce model sizes, the most informative subtree may be disrupted by the randomness in mutation. To address this issue, we develop a semantic-based hoist mutation operator in this paper to preserve the most informative subtree that has the largest cosine similarity between its semantics and the target semantics. Experimental results on 98 regression datasets from the Penn Machine Learning Benchmark show that using this operator not only significantly reduces model size, but also improves the test accuracy of features constructed by genetic programming. A comparison with seven bloat control methods shows that the proposed operator achieves the best trade-off between accuracy and model size. Moreover, an experiment on the state-of-the-art symbolic regression benchmark shows that genetic programming with the semantic-based hoist mutation operator achieves the best test accuracy and competitive model sizes compared with 22 symbolic regression and machine learning algorithms.
Variations in commands executed as part of the attack process can be used to determine the behavioural patterns of IoT attacks. Existing approaches rely on the domain knowledge of security experts to ...identify the behavioural patterns, categorise and classify cyber attacks. We proposed an Autoencoder (AE)-based feature construction approach to remove the dependency of manually correlating commands and generate an efficient representation by automatically learning the semantic similarity between input features extracted through commands data. We applied three clustering algorithms, i.e., K-means, Gaussian Mixture Models and Density-based spatial clustering of applications with noise, on our data set of AE features. We discussed the clustering arrangements for understanding the impact of changes in commands on behavioural patterns of attacks and how attacks are grouped in the same or different clusters. Evaluation of our feature construction approach shows that the clustering algorithm grouped attacks with more common features values compared to clustering with original features. Moreover, we performed a comparative analysis of two existing feature extraction approaches on our data set considering the type of analysis in the process, generalisability of applying features, coverage to the data set and clustering arrangements. We found that challenges identified in applying existing approaches can be addressed with our proposed approach and improving features with AE resulted in providing meaningful clustering interpretations.
•The impact of changes in commands on behavioural patterns of IoT attacks.•Autoencoder (AE)-based feature construction approach for clustering IoT attacks.•IoT attacks clustering with AE features and discussed clustering arrangements.•Evaluation of AE features and original features for clustering attacks.•Comparative analysis of proposed approach with existing studies.