•The proposed soybean yield forecasting framework improves soybean yield prediction accuracy.•Accurate and stable in-season yield predictions are achieved from the soybean pod-setting ...stage.•Extensive assessment of the impact of multiple factors on soybean yield prediction.•In-depth interpretation of soybean yield forecasts using feature importance analysis and SHAP.
Yield prediction is essential in food security, food trade, and field management. However, due to the associated complex formation mechanisms of yield, accurate and timely yield prediction remains challenging in remote sensing-based crop monitoring domains. In this study, a framework of soybean yield prediction integrating extreme gradient boosting (XGBoost) and multidimensional feature engineering was developed at the county level in the United States using publicly available datasets. Excellent accuracy values were obtained for over 959 counties in 12 states throughout the midwestern U.S., with a test coefficient of determination (R2) of 0.82 and a root-mean-square error (RMSE) of 0.246 t/ha, using our approach. Following a “train–validate–test” assessment strategy, our study shows that XGBoost outperforms other county-level soybean yield prediction models with identical inputs, including linear regression (LR), random forest (RF), k-nearest neighbor (KNN), artificial neural network (ANN), support vector regression (SVR), long short-term memory (LSTM), and deep neural network (DNN). The results show that accurate results of soybean yield prediction can be obtained as early as the pod-setting stage. We implemented the feature importance and Shapley additive explanations (SHAP) algorithms to quantify the impact of input features on the XGBoost model in the training and prediction stages, respectively. The enhanced vegetation index (EVI) at the pod-setting period is the most crucial factor, but the yield prediction is not dependent on only a few key features. Yields were detrended using longer-term historical yield data, and R2 increased from 0.58 to 0.82 while RMSE decreased from 0.374 t/ha to 0.246 t/ha. We employed multidimensional feature engineering to generate phenology-based features, and R2 improved from 0.79 to 0.82 while RMSE decreased from 0.268 t/ha to 0.246 t/ha using this approach. The framework can be easily implemented and extended in the future in combination with early crop identification.
The present study aims at predicting the maximum temperature in line contacts depending on operating conditions. For this purpose, a thermo-elastohydrodynamic lubrication (TEHL) simulation model of a ...line contact is used to calculate the maximum temperature for a wide range of parameters. Subsequently, a neural networks (NN) approach is used to develop a surrogate model that is able to predict the maximum temperature on the basis of the operational parameters. The influence of different NN architectures and transfer functions on the accuracy is shown. A good agreement with a correlation coefficient (R) greater than 0.997 is achieved for a NN with two hidden layers. Furthermore, the impact of feature engineering on the prediction accuracy with limited data sets is presented.
•Local temperature detection in rolling contact bearings on the basis of TEHL simulations.•Efficient temperature prediction with a neural network-based surrogate model.•Influence of feature engineering on the performance of neural network model.•Impact of neural network architectures on the prediction accuracy.
Although predictive machine learning for supply chain data analytics has recently been reported as a significant area of investigation due to the rising popularity of the AI paradigm in industry, ...there is a distinct lack of case studies that showcase its application from a practical point of view. In this paper, we discuss the application of data analytics in predicting first tier supply chain disruptions using historical data available to an Original Equipment Manufacturer (OEM). Our methodology includes three phases: First, an exploratory phase is conducted to select and engineer potential features that can act as useful predictors of disruptions. This is followed by the development of a performance metric in alignment with the specific goals of the case study to rate successful methods. Third, an experimental design is created to systematically analyse the success rate of different algorithms, algorithmic parameters, on the selected feature space. Our results indicate that adding engineered features in the data, namely agility, outperforms other experiments leading to the final algorithm that can predict late orders with 80% accuracy. An additional contribution is the novel application of machine learning in predicting supply disruptions. Through the discussion and the development of the case study we hope to shed light on the development and application of data analytics techniques in the analysis of supply chain data. We conclude by highlighting the importance of domain knowledge for successfully engineering features.
Accurate wind power forecasting is essential for efficient operation and maintenance (O&M) of wind power conversion systems. Offshore wind power predictions are even more challenging due to the ...multifaceted systems and the harsh environment in which they are operating. In some scenarios, data from Supervisory Control and Data Acquisition (SCADA) systems are used for modern wind turbine power forecasting. In this study, a deep learning neural network was constructed to predict wind power based on a very high-frequency SCADA database with a sampling rate of 1-s. Input features were engineered based on the physical process of offshore wind turbines, while their linear and non-linear correlations were further investigated through Pearson product-moment correlation coefficients and the deep learning algorithm, respectively. Initially, eleven features were used in the predictive model, which are four wind speeds at different heights, three measured pitch angles of each blade, average blade pitch angle, nacelle orientation, yaw error, and ambient temperature. A comparison between different features shown that nacelle orientation, yaw error, and ambient temperature can be reduced in the deep learning model. The simulation results showed that the proposed approach can reduce the computational cost and time in wind power forecasting while retaining high accuracy.
•Non-linear correlations of features to wind power was identified via deep learning.•Blade pitch angle is significant for power prediction at above-rated wind speeds.•Wind speeds at various heights and wind shear are involved in power prediction.•Deep learning model retains high accuracy at a lower computational cost.
Stock price modeling and prediction have been challenging objectives for researchers and speculators because of noisy and non-stationary characteristics of samples. With the growth in deep learning, ...the task of feature learning can be performed more effectively by purposely designed network. In this paper, we propose a novel end-to-end model named multi-filters neural network (MFNN) specifically for feature extraction on financial time series samples and price movement prediction task. Both convolutional and recurrent neurons are integrated to build the multi-filters structure, so that the information from different feature spaces and market views can be obtained. We apply our MFNN for extreme market prediction and signal-based trading simulation tasks on Chinese stock market index CSI 300. Experimental results show that our network outperforms traditional machine learning models, statistical models, and single-structure(convolutional, recurrent, and LSTM) networks in terms of the accuracy, profitability, and stability.
Lately, with deep learning outpacing the other machine learning techniques in classifying images, we have witnessed a growing interest of the remote sensing community in employing these techniques ...for the land use and land cover classification based on multispectral and hyperspectral images; the number of related publications almost doubling each year since 2015 is an attest to that. The advances in remote sensing technologies, hence the fast-growing volume of timely data available at the global scale, offer new opportunities for a variety of applications. Deep learning being significantly successful in dealing with Big Data, seems to be a great candidate for exploiting the potentials of such complex massive data. However, there are some challenges related to the ground-truth, resolution, and the nature of data that strongly impact the performance of classification. In this paper, we review the use of deep learning in land use and land cover classification based on multispectral and hyperspectral images and we introduce the available data sources and datasets used by literature studies; we provide the readers with a framework to interpret the-state-of-the-art of deep learning in this context and offer a platform to approach methodologies, data, and challenges of the field.
Diffuse large B-cell lymphoma (DLBCL) is the most common subtype of Non-Hodgkin’s Lymphoma, presenting a great challenge for treatment due to its highly heterogeneous nature. DLBCL is diagnosed based ...on microscopy images of patient tissue samples. To help gain a better understanding of DLBCL, we developed an automated computer vision method to analyze morphological and color-based information within patient biopsies. We analyzed a dataset of whole slide images of DLBCL by segmenting individual cells and representing cell morphologies through a set of engineered features. The features were evaluated using a variety of visualization and machine learning (ML) classification techniques. Current state-of-the-art deep learning methods use images as the input in classification tasks achieving high performance but lacking in interpretability. A big challenge lies in finding out what features the pixel-based deep learning methods utilize in prediction. Here, we present a technique that not only yields high prediction accuracy but also provides insights into which of the features are key for prediction. We show that the color-based features have the highest importance for cell classification, allowing for the accurate identification of various cell types with an accuracy of 84% in a multi-class and 91% in a binary classification. Our results provide valuable insights for exploring cell image datasets to gain an in-depth view of the tumor microenvironment.
The proliferation of false information is a growing problem in today's dynamic online environment. This phenomenon requires automated detection of fake news to reduce its harmful effect on society. ...Even though various methods are used to detect fake news, most methods only consider data-oriented text features; ignoring dual emotion features (publisher emotions and social emotions) and thus lack higher levels of accuracy. This study addresses this issue by utilizing dual emotion features to detect fake news. The study proposes a Deep Normalized Attention-based mechanism for enriched extraction of dual emotion features and an Adaptive Genetic Weight Update-Random Forest (AGWu-RF) for classification. First, the deep normalized attention-based mechanism incorporates BiGRU, which improves feature value by extracting long-range context information to eliminate gradient explosion issues. The genetic weight for the model is adjusted to RF and updated to achieve optimized hyper parameter values that support the classifiers' detection accuracy. The proposed model outperforms baseline methods on standard benchmark metrics in three real-world datasets. It outperforms state-of-the-art approaches by 5%, 11%, and 14% in terms of accuracy, highlighting the significance of dual emotion capabilities and optimizations in improving fake news detection.
Paying attention to the feature engineering problems is the basis for constructing a more accurate building energy consumption prediction model, which helps debug, control, and operate building ...energy management systems. Therefore, in this paper, an integrated energy consumption prediction model considering spatial characteristics in time series data is proposed to predict the short-term energy consumption of educational buildings, and the influence of features on the model is analyzed using the cooperative game theory SHAP method, and the optimal number of features is determined by ablation analysis. The proposed model is validated by an educational building in Xi'an, Shaanxi Province. The results show that compared with other energy consumption prediction models, the RMSE value of the integrated energy consumption prediction model is reduced by 13.64%–34.55%, and the MAE value is reduced by 10.25%–30.54%, which has higher prediction accuracy. In addition, this paper also investigates the minimum amount of data and the number of features required for the training of the building energy prediction model, and the integrated energy prediction model can still effectively predict building energy consumption when the training samples are minimal and the number of features is appropriate.
•Feature engineering method using deep learning to mine features.•Analyzed the impact of features on predictive models using cooperative game theory.•Analyzed the impact of dataset size and number of features on model predictions.•Developed an integrated model for building energy consumption prediction.•The main factors affecting energy consumption in educational buildings are discussed.
•Automated detection of epilepsy using EEG signals from 121 participants.•Hypercube-based feature extractor and multilevel discrete wavelet transform techniques are employed.•Neighborhood component ...analysis (NCA) is used as a feature selector.•Attained 87.78% classification accuracy using voting and 79.07% with LOSO CV.
Epilepsy is one of the most commonly seen neurologic disorders worldwide and has generally caused seizures. Electroencephalography (EEG) is widely used in seizure diagnosis. To detect epilepsy automatically, various machine learning (ML) models have been introduced in the literature, but the used EEG signal datasets for epilepsy detection are relatively small. Our main objective is to present a large EEG signal dataset and investigate the detection ability of a new hypercube pattern-based framework using the EEG signals.
This study collected a large EEG signal dataset (10,356 EEG signals) from 121 participants. We proposed a new information fusion-based feature engineering framework to get high classification performance from this dataset. The dataset consists of 35 channels, and our proposed feature engineering model extracts features from each channel. A new hypercube-based feature extractor has been proposed to generate two feature vectors in the feature extraction phase. Various statistical parameters of the signals have been used to create a feature vector. Multilevel discrete wavelet transform (MDWT) has been applied to develop a multileveled feature extraction function, and seven feature vectors have been extracted. In this work, we have extracted 245 (=35 × 7) feature vectors, and the most valuable features from these vectors have been selected using the neighborhood component analysis (NCA) selector. Finally, these selected features were fed to the k nearest neighbors (kNN) classifier with the leave one subject out (LOSO) cross-validation (CV) strategy. These results have been voted/fused to obtain the highest classification performance.
In this work, we have attained 87.78% classification accuracy using voting these vectors and 79.07% with LOSO CV with the EEG signals.
The proposed fusion-based feature engineering model achieved satisfactory classification performance using the largest EEG signal datasets for epilepsy detection.