The identification of underground formation lithology is fundamental in reservoir characterization during petroleum exploration. With the increasing availability and diversity of well-logging data, ...automated interpretation of well-logging data is in great demand for more efficient and reliable decision making for geologists and geophysicists. This study benchmarked the performances of an array of machine learning models, from linear and nonlinear individual classifiers to ensemble methods, on the task of lithology identification. Cross-validation and Bayesian optimization were utilized to optimize the hyperparameters of different models and performances were evaluated based on the metrics of accuracy—the area under the receiver operating characteristic curve (AUC), precision, recall, and F1-score. The dataset of the study consists of well-logging data acquired from the Baikouquan formation in the Mahu Sag of the Junggar Basin, China, including 4156 labeled data points with 9 well-logging variables. Results exhibit that ensemble methods (XGBoost and RF) outperform the other two categories of machine learning methods by a material margin. Within the ensemble methods, XGBoost has the best performance, achieving an overall accuracy of 0.882 and AUC of 0.947 in classifying mudstone, sandstone, and sandy conglomerate. Among the three lithology classes, sandy conglomerate, as in the potential reservoirs in the study area, can be best distinguished with accuracy of 97%, precision of 0.888, and recall of 0.969, suggesting the XGBoost model as a strong candidate machine learning model for more efficient and accurate lithology identification and reservoir quantification for geologists.
Among natural hazards occurring offshore, submarine landslides pose a significant risk to offshore infrastructure installations attached to the seafloor. With the offshore being important for current ...and future energy production, there is a need to anticipate where future landslide events are likely to occur to support planning and development projects. Using the northern Gulf of Mexico (GoM) as a case study, this paper performs Landslide Susceptibility Mapping (LSM) using a gradient-boosted decision tree (GBDT) model to characterize the spatial patterns of submarine landslide probability over the United States Exclusive Economic Zone (EEZ) where water depths are greater than 120 m. With known spatial extents of historic submarine landslides and a Geographic Information System (GIS) database of known topographical, geomorphological, geological, and geochemical factors, the resulting model was capable of accurately forecasting potential locations of sediment instability. Results of a permutation modelling approach indicated that LSM accuracy is sensitive to the number of unique training locations with model accuracy becoming more stable as the number of training regions was increased. The influence that each input feature had on predicting landslide susceptibility was evaluated using the SHapely Additive exPlanations (SHAP) feature attribution method. Areas of high and very high susceptibility were associated with steep terrain including salt basins and escarpments. This case study serves as an initial assessment of the machine learning (ML) capabilities for producing accurate submarine landslide susceptibility maps given the current state of available natural hazard-related datasets and conveys both successes and limitations.
Seawalls are critical defence infrastructures in coastal zones that protect hinterland areas from storm surges, wave overtopping and soil erosion hazards. Scouring at the toe of sea defences, caused ...by wave-induced accretion and erosion of bed material imposes a significant threat to the structural integrity of coastal infrastructures. Accurate prediction of scour depths is essential for appropriate and efficient design and maintenance of coastal structures, which serve to mitigate risks of structural failure through toe scouring. However, limited guidance and predictive tools are available for estimating toe scouring at sloping structures. In recent years, Artificial Intelligence and Machine Learning (ML) algorithms have gained interest, and although they underpin robust predictive models for many coastal engineering applications, such models have yet to be applied to scour prediction. Here we develop and present ML-based models for predicting toe scour depths at sloping seawall. Four ML algorithms, namely, Random Forest (RF), Gradient Boosted Decision Trees (GBDT), Artificial Neural Networks (ANNs), and Support Vector Machine Regression (SVMR) are utilised. Comprehensive physical modelling measurement data is utilised to develop and validate the predictive models. A Novel framework for feature selection, feature importance, and hyperparameter tuning algorithms are adopted for pre- and post-processing steps of ML-based models. In-depth statistical analyses are proposed to evaluate the predictive performance of the proposed models. The results indicate a minimum of 80% prediction accuracy across all the algorithms tested in this study and overall, the SVMR produced the most accurate predictions with a Coefficient of Determination ( r 2 ) of 0.74 and a Mean Absolute Error (MAE) value of 0.17. The SVMR algorithm also offered most computationally efficient performance among the algorithms tested. The methodological framework proposed in this study can be applied to scouring datasets for rapid assessment of scour at coastal defence structures, facilitating model-informed decision-making.
Machine learning models have become widespread in materials science research. An open‐access and community‐driven database containing over 40 000 perovskite photovoltaic devices has been recently ...published. This resource enables the application of predictive data‐driven models to correlate device structure with photovoltaic performance, whereas the literature usually focuses on specific device layers. Herein, the concept of device‐level performance prediction is explored using gradient‐boosted regression trees as the core algorithm and Shapley values analysis to interpret and rationalize the results. The main pitfalls and conceptual limitations of the approach are discussed and correlated with the database structure and dimension, by comparing the performance of different choices of descriptors and dataset size. Evidence suggests that the additional features introduced herein, in particular chemical descriptors of perovskite additives, can boost regression performance at a device level. A specific model is finally trained to predict the performance of unseen devices and tested on experimental data from the literature. This task is found to be particularly challenging, as the ability of the model to generalize to a new chemical space is limited by several factors, including the amount and the quality of available data.
Soil spectroscopy has experienced a tremendous increase in soil property characterisation, and can be used not only in the laboratory but also from the space (imaging spectroscopy). Partial least ...squares (PLS) regression is one of the most common approaches for the calibration of soil properties using soil spectra. Besides functioning as a calibration method, PLS can also be used as a dimension reduction tool, which has scarcely been studied in soil spectroscopy. PLS components retained from high-dimensional spectral data can further be explored with the gradient-boosted decision tree (GBDT) method. Three soil sample categories were extracted from the Land Use/Land Cover Area Frame Survey (LUCAS) soil library according to the type of land cover (woodland, grassland, and cropland). First, PLS regression and GBDT were separately applied to build the spectroscopic models for soil organic carbon (OC), total nitrogen content (N), and clay for each soil category. Then, PLS-derived components were used as input variables for the GBDT model. The results demonstrate that the combined PLS-GBDT approach has better performance than PLS or GBDT alone. The relative important variables for soil property estimation revealed by the proposed method demonstrated that the PLS method is a useful dimension reduction tool for soil spectra to retain target-related information.
Street vitality has become an important indicator for evaluating the attractiveness and potential for the sustainable development of urban neighborhoods. However, research on this topic may ...overestimate or underestimate the effects of different influencing factors, as most studies overlook the prevalent nonlinear and synergistic effects. This study takes the central urban districts of humid–hot cities in developing countries as an example, utilizing readily available big data sources such as Baidu Heat Map data, Baidu Map data, Baidu Building data, urban road network data, and Amap’s Point of Interest (POI) data to construct a Gradient-Boosting Decision Tree (GBDT) model. This model reveals the nonlinear and synergistic effects of different built environment factors on street vitality. The study finds that (1) construction intensity plays a crucial role in the early stages of urban street development (with a contribution value of 0.71), and as the city matures, the role of diversity gradually becomes apparent (with the contribution value increasing from 0.03 to 0.08); (2) the built environment factors have nonlinear impacts on street vitality; for example, POI density has different thresholds in the three cities (300, 200, and 500); (3) there are significant synergistic effects between different dimensions and indicators of the built environment, such as when the POI density is high and integration exceeds 1.5, a positive synergistic effect is notable, whereas a negative synergistic effect occurs when POI is low. This article further discusses the practical implications of the research findings, providing nuanced and targeted policy suggestions for humid–hot cities at different stages of development.
Traditionally, mathematical optimization methods have been applied in manufacturing industries where production scheduling is one of the most important problems and is being actively researched. ...Extant studies assume that processing times are known or follow a simple distribution. However, the actual processing time in a factory is often unknown and likely follows a complex distribution. Therefore, in this study, we consider estimating the processing time using a machine-learning model. Although there are studies that use machine learning for scheduling optimization itself, it should be noted that the purpose of this study is to estimate an unknown processing time. Using machine-learning models, one can estimate processing times that follow an unknown and complex distribution while further improving the schedule using the computed importance variable. Based on the above, we propose a system for estimating the processing time using machine-learning models when the processing time follows a complex distribution in actual factory data. The advantages of the proposed system are its versatility and applicability to a real-world factory where the processing times are often unknown. The proposed method was evaluated using process information with the processing time for each manufacturing sample provided by research partner companies. The Light gradient-boosted machine (LightGBM) algorithm and Ridge performed the best with MAPE and RMSE. The optimization of parallel machine scheduling using estimated processing time by our method resulted in an average reduction of approximately 30% for the makespan. On the other hands, the results of probabilistic sampling methods which are Kernel Density Estimation, Gamma distribution, and Normal Distribution have shown poorer performance than ML approaches. In addition, machine-learning models can be used to deduce variables that affect the estimation of processing times, and in this study, we demonstrated an example of feature importance computed from experimental data. In addition, machine-learning models can be used to deduce variables that affect the estimation of processing times, and in this study, we demonstrated an example of feature importance computed from experimental data.
With the development of urban science, researches on mining of urban big data have attracted more and more attention. One typical microcosm of urban big data is taxi trajectory data. Predicting the ...travel time between the two specified points accurately is great significance for applications, such as travel plan. However, the current approach just uses limited modality data or single model without considering their one-sidedness. This paper puts forward to one optimized method to estimate travel time, which is based on ensemble method with multi-modality urban big data, namely Travel Time Estimation-Ensemble (TTE-Ensemble). First, we extract the feature sub-vectors from the multi-modality data as the model input. Then we use the gradient boosting decision tree (GBDT) model to process the low dimensional simple features and adopt the deep neural network (DNN) model to handle high dimensional underlying features. Finally, the ensemble method was introduced to integrate the two model of GBDT and the DNN. Extensive experiments were conducted based on real datasets of origin-destination points in Chengdu and Shanghai, China. These experiments demonstrate the superiority of the TTE-Ensemble model.
This paper provides detailed information about team Leustagos' approach to the wind power forecasting track of GEFCom 2012. The task was to predict the hourly power generation at seven wind farms, 48 ...hours ahead. The problem was addressed by extracting time- and weather-related features, which were used to build gradient-boosted decision trees and linear regression models. This approach achieved first place in both the public and private leaderboards. PUBLICATION ABSTRACT
Alzheimer’s disease (AD) is one of the most common forms of dementia contributing to more than 70% of the cases. The factors accounting for the cause and progression of neurodegenerative diseases ...like AD are primarily genetic, in addition to life style and environmental factors. Early and accurate diagnoses of AD empower practitioners to take timely clinical decisions and preventive actions. This being the motivation, the work proposes a novel pattern matching and scoring method on genetic material towards devising an effective classifier. We propose a distinctive disease causing gene sequence pattern identification using suffix trees as a base detection model with an accuracy of 91.5% in linear time complexity. A scoring mechanism is implemented to assign scores to genes based on the severity of the disease causing and disease resistant Single Nucleotide Polymorphisms associated with the genes. These scores are then used as a remarkable feature in the gradient boosted decision tree classifier to enhance the classification of AD versus healthy control. The efficiency of the proposed gene powered EGBDT classifier is evaluated on ADNI benchmark data set with the prediction accuracy of 94.16% and is found to be efficient compared to the recent works in the literature.