Time series feature engineering is a time-consuming process because scientists and engineers have to consider the multifarious algorithms of signal processing and time series analysis for identifying ...and extracting meaningful features from time series. The Python package tsfresh (Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests) accelerates this process by combining 63 time series characterization methods, which by default compute a total of 794 time series features, with feature selection on basis automatically configured hypothesis tests. By identifying statistically significant time series characteristics in an early stage of the data science process, tsfresh closes feedback loops with domain experts and fosters the development of domain specific features early on. The package implements standard APIs of time series and machine learning libraries (e.g. pandas and scikit-learn) and is designed for both exploratory analyses as well as straightforward integration into operational data science applications.
Display omitted
•Characterizing high entropy alloys with machine learning and feature engineering.•Augmenting the dimensionality by non-linear combinations of original descriptors.•Linear machine ...learning model shows better generalization performance with non-linear descriptors.
The prediction of the phase formation of high entropy alloys (HEAs) has attracted great research interest recent years due to their superior structure and mechanical properties of single phase. However, the identification of these single phase solid solution alloys is still a challenge. Previous studies mainly focus on trial-and-error experiments or thermodynamic criteria, the previous is time consuming while the latter depends on the descriptors quality, both provide unreliable prediction. In this study, we attempted to predict the phase formation based on feature engineering and machine learning (ML) with a small dataset. The descriptor dimensionality is augmented from original small dimension to high dimension by non-linear combinations to characterize HEAs. The results showed that this method could achieve higher accuracy in predicting the phase formation of HEAs than traditional methods. Except the prediction of HEAs, this method also can be applied to other materials with limited dataset.
•This paper reviews building load prediction with machine learning techniques.•Review and technical papers are searched by Sub-keyword Synonym Searching method.•Technical papers are reviewed in terms ...of application, algorithms, and data.•Primary limitations and gaps are identified; future trends are predicted.•A guidance for future technical paper on building load prediction is proposed.
The surge of machine learning and increasing data accessibility in buildings provide great opportunities for applying machine learning to building energy system modeling and analysis. Building load prediction is one of the most critical components for many building control and analytics activities, as well as grid-interactive and energy efficiency building operation. While a large number of research papers exist on the topic of machine-learning-based building load prediction, a comprehensive review from the perspective of machine learning is missing. In this paper, we review the application of machine learning techniques in building load prediction under the organization and logic of the machine learning, which is to perform tasks T using Performance measure P and based on learning from Experience E.
Firstly, we review the applications of building load prediction model (task T). Then, we review the modeling algorithms that improve machine learning performance and accuracy (performance P). Throughout the papers, we also review the literature from the data perspective for modeling (experience E), including data engineering from the sensor level to data level, pre-processing, feature extraction and selection. Finally, we conclude with a discussion of well-studied and relatively unexplored fields for future research reference. We also identify the gaps in current machine learning application and predict for future trends and development.
Credit card transaction fraud costs billions of dollars to card issuers every year. A well-developed fraud detection system with a state-of-the-art fraud detection model is regarded as essential to ...reducing fraud losses. The main contribution of our work is the development of a fraud detection system that employs a deep learning architecture together with an advanced feature engineering process based on homogeneity-oriented behavior analysis (HOBA). Based on a real-life dataset from one of the largest commercial banks in China, we conduct a comparative study to assess the effectiveness of the proposed framework. The experimental results illustrate that our proposed methodology is an effective and feasible mechanism for credit card fraud detection. From a practical perspective, our proposed method can identify relatively more fraudulent transactions than the benchmark methods under an acceptable false positive rate. The managerial implication of our work is that credit card issuers can apply the proposed methodology to efficiently identify fraudulent transactions to protect customers’ interests and reduce fraud losses and regulatory costs.
•LSTM is considered for flood susceptibility prediction in a sequence perspective.•An appropriate feature engineering method is integrated with the LSTM network.•A reliable flood susceptibility map ...can be obtained by using the LSS-LSTM method.•The proposed method can achieve better performance than benchmark methods.
Identifying floods and producing flood susceptibility maps are crucial steps for decision-makers to prevent and manage disasters. Plenty of studies have used machine learning models to produce reliable susceptibility maps. Nevertheless, most research ignores the importance of developing appropriate feature engineering methods. In this study, we propose a local spatial sequential long short-term memory neural network (LSS-LSTM) for flood susceptibility prediction in Shangyou County, China. The three main contributions of this study are summarized below. First of all, it is a new perspective to use the deep learning technique of LSTM for flood susceptibility prediction. Second, we integrate an appropriate feature engineering method with LSTM to predict flood susceptibility. Third, we implement two optimization techniques of data augmentation and batch normalization to further improve the performance of the proposed method. The LSS-LSTM method can not only capture the attribution information of flood conditioning factors and the local spatial information of flood data, but also has powerful sequential modelling capabilities to deal with the spatial relationship of floods. The experimental results demonstrate that the LSS-LSTM method achieves satisfactory prediction performance (93.75% and 0.965) in terms of accuracy and area under the receiver operating characteristic (ROC) curve.
•A novel Feature Engineering solution for theft detection in Smart Grids is introduced.•Demand data from more than 4000 households are used to benchmark the solution.•Six different attack scenarios ...and five machine learning algorithms are examined.•Gradient Boosting is deployed and shown to outperform previous fraud detection models.•Effects of unforeseen, zero-day, and irregular attacks are examined.
Despite many potential advantages, Advanced Metering Infrastructures have introduced new ways to falsify meter readings and commit electricity theft. This study contributes a new model-agnostic, feature-engineering framework for theft detection in smart grids. The framework introduces a combination of Finite Mixture Model clustering for customer segmentation and a Genetic Programming algorithm for identifying new features suitable for prediction. Utilizing demand data from more than 4000 households, a Gradient Boosting Machine algorithm is applied within the framework, significantly outperforming the results of prior machine-learning, theft-detection methods. This study further examines some important practical aspects of deploying theft detection including: the detection delay; the required size of historical demand data; the accuracy in detecting thefts of various types and intensity; detecting irregular and unseen attacks; and the computational complexity of the detection algorithm.
Conventional machine learning approaches for predicting material properties from elemental compositions have emphasized the importance of leveraging domain knowledge when designing model inputs. ...Here, we demonstrate that by using a deep learning approach, we can bypass such manual feature engineering requiring domain knowledge and achieve much better results, even with only a few thousand training samples. We present the design and implementation of a deep neural network model referred to as ElemNet; it automatically captures the physical and chemical interactions and similarities between different elements using artificial intelligence which allows it to predict the materials properties with better accuracy and speed. The speed and best-in-class accuracy of ElemNet enable us to perform a fast and robust screening for new material candidates in a huge combinatorial space; where we predict hundreds of thousands of chemical systems that could contain yet-undiscovered compounds.
The struggle between security analysts and malware developers is a never-ending battle with the complexity of malware changing as quickly as innovation grows. Current state-of-the-art research focus ...on the development and application of machine learning techniques for malware detection due to its ability to keep pace with malware evolution. This survey aims at providing a systematic and detailed overview of machine learning techniques for malware detection and in particular, deep learning techniques. The main contributions of the paper are: (1) it provides a complete description of the methods and features in a traditional machine learning workflow for malware detection and classification, (2) it explores the challenges and limitations of traditional machine learning and (3) it analyzes recent trends and developments in the field with special emphasis on deep learning approaches. Furthermore, (4) it presents the research issues and unsolved challenges of the state-of-the-art techniques and (5) it discusses the new directions of research. The survey helps researchers to have an understanding of the malware detection field and of the new developments and directions of research explored by the scientific community to tackle the problem.
•It presents a systematic review of M.L. approaches for malware detection.•Traditional approaches are classified into static, dynamic and hybrid approaches.•It provides a detailed description of the features in a traditional M.L. worflkow.•It introduces new research directions such as deep learning and multimodal approaches.•It discusses the research issues and challenges faced by security researchers.
Author profiling consists of extracting their demographic and psychographic information by examining their writings. This information can then be used to improve the reader experience and to detect ...bots or propagators of hoaxes and/or hate speech. Therefore, author profiling can be applied to build more robust and efficient Knowledge-Based Systems for tasks such as content moderation, user profiling, and information retrieval. Author profiling is typically performed automatically as a document classification task. Recently, language models based on transformers have also proven to be quite effective in this task. However, the size and heterogeneity of novel language models, makes it necessary to evaluate them in context. The contributions we make in this paper are four-fold: First, we evaluate which language models are best suited to perform author profiling in Spanish. These experiments include basic, distilled, and multilingual models. Second, we evaluate how feature integration can improve performance for this task. We evaluate two distinct strategies: knowledge integration and ensemble learning. Third, we evaluate the ability of linguistic features to improve the interpretability of the results. Fourth, we evaluate the performance of each language model in terms of memory, training, and inference times. Our results indicate that the use of lightweight models can indeed achieve similar performance to heavy models and that multilingual models are actually less effective than models trained with one language. Finally, we confirm that the best models and strategies for integrating features ultimately depend on the context of the task.
•Study of large language models for conducting Author Profiling in Spanish.•Feature integration improves the performance of large language models.•Interpretability of profiles using linguistic features.•Hyperlinks and hashtag are strong features when discerning between bots and humans.•Stylometry is the linguistic category most relevant in Author Profiling.