The factors influencing residents health have become complex and intertwined with the development of economy and society. Traditional research with a single factor on health will not provide an ...accurate picture of the situation. This paper collects data on economic, environmental and social factors to estimate their impact on regional health. Considering the data is multi-source and complex, this paper proposes a combined feature importance algorithm, which weighted the feature importance of RF, XGB and SOIL. The algorithm does not depend on the data and adaptively approximates the true results. The results show that economic factors have a significant and direct impact on health, environmental factors have a lag correlation with health level, and social factors have a more complicated effect on health. Finally, we provide policy suggestions for health on economic, environmental, and social development.
The renewal of green home appliances is a crucial measure for households to save energy and reduce emissions. However, how online reviews, especially those relate to energy-saving, affect green home ...appliance purchase behavior (GHAPB) lacks exploration. In this paper, we investigate over 1 million online reviews on about 3,116 types of air conditioner from JD. By applying word2vec, we divide energy-saving related information in the following three types: norm information, environmental health information and price information, and construct dictionaries for each. Then, the effect value of energy-saving information is quantified from perspectives of breadth, depth and intensity through sentiment analysis. The influence of energy-saving information in online reviews on GHAPB is finally analyzed by linear regression and machine learning models. The results show that all energy-saving information has positive impact on GHAPB, and environmental health information is the most important one. In addition, the attributes of online reviews impose a greater influence on GHAPB, comparing with those of products. The in-depth exploration of energy-saving information in online reviews provides targeted recommendations for the manufacturer and the retailer to promote the adoption of green home appliances.
The rapid increase in both the quantity and complexity of data that are being generated daily in the field of environmental science and engineering (ESE) demands accompanied advancement in data ...analytics. Advanced data analysis approaches, such as machine learning (ML), have become indispensable tools for revealing hidden patterns or deducing correlations for which conventional analytical methods face limitations or challenges. However, ML concepts and practices have not been widely utilized by researchers in ESE. This feature explores the potential of ML to revolutionize data analysis and modeling in the ESE field, and covers the essential knowledge needed for such applications. First, we use five examples to illustrate how ML addresses complex ESE problems. We then summarize four major types of applications of ML in ESE: making predictions; extracting feature importance; detecting anomalies; and discovering new materials or chemicals. Next, we introduce the essential knowledge required and current shortcomings in ML applications in ESE, with a focus on three important but often overlooked components when applying ML: correct model development, proper model interpretation, and sound applicability analysis. Finally, we discuss challenges and future opportunities in the application of ML tools in ESE to highlight the potential of ML in this field.
In this paper we test new approaches for predicting the amount of element oxides in rock samples from the ChemCam instrument suite onboard the NASA Curiosity rover by focusing on K2O. Using the ...expanded dataset compiled by Gasda et al. (2021) with and without the Earth to Mars (E2M and NoE2M) transformation discussed in Clegg et al. (2017) we trained blended submodels using the “double blending” technique and compared these to ensemble methods (Random Forest, ExtraTrees, and Gradient Boosting Regression). We found that ensemble methods performed similar to blended submodels when looking at RMSE-P on the laboratory spectra and provided significant advantages when looking at spectra coming from Mars. For the full model, blended submodels achieved an RMSE-P of 0.62 and 0.60 (E2M and NoE2M respectively) while Gradient Boosting Regression resulted in a slightly improved RMSE-P of 0.59 and 0.60. More importantly, by employing a local RMSE-P estimation technique where model performance is evaluated based on nearby test samples we found that using ensemble methods can lower the quantification limit for K2O from the current value of ≈0.6 wt% to ≈0.08 wt% using Extra Trees and Random Forest. This would allow for a much larger range of K2O values to be quantified on Mars with greater certainty given that most targets seen on Mars tend to have <1 wt% K2O. Finally, we used both Mean Decrease in Impurity (MDI) and permutation importance techniques to investigate the wavelengths used by the ensemble methods and found that they correspond to known potassium emission lines. This suggests that ensemble methods can provide an easier to train and improved alternative to blended submodels for predicting potassium compositions from Laser Induced Breakdown Spectroscopy (LIBS) data.
Display omitted
•Ensemble methods (EM) are an easy to train and explainable approach to predict %K2O•K emission lines appear in MDI and permutation importance techniques on trained EMs•EMs lower the ChemCam quantification limit for K2O from ∼0.6 wt% to ∼0.08 wt%•Ensemble methods perform better without the Earth to Mars transformation
•Exploration of Sentinel-2 time series for tree species mapping.•Land surface phenology and composite imagery outperform regular multitemporal imagery for mapping tree species.•Our approach provides ...high mapping accuracy in areas of frequent cloud cover.•Feature importance reveals the importance of Sentinel-2 SWIR bands.
Optical satellite imagery with high temporal and spatial resolution, such as acquired by Sentinel-2, is increasingly becoming available and is used to derive maps of tree species. Such mapping products are required in the scope of operational and sustainable forest management. Existing studies that employ Sentinel-2 imagery have already evaluated different classification algorithms but are often confined to areas smaller than a single Sentinel-2 scene. In this study, the area of interest (a large part of the Province of Tyrol (Austria)) is covered by two Sentinel-2 tiles, of which approximately 5000 km² are forested. In order to deal with seasonal metrics under recurrent cloud cover conditions, we exploit one year of Sentinel-2 imagery by using land surface phenology (LSP) and seasonal cloud-free composites for mapping five different tree species groups (Broadleaved-, Larch- (Larix), Pine- (Pinus), Dwarf Pine- (Pinus mugo) and Spruce/Fir (Abies alba/Picea abies) stands). Although a regular multitemporal classification setup based on three cloud-free images reached an overall accuracy of around 84.4 % and outperformed monotemporal setups by around 10 % points, the availability of single cloud-free images was limited in the mountainous region. Thus, alternative approaches, using combined measures for the entire time series of Sentinel-2 imagery, i.e. three-monthly temporal reflectance composites and phenological metrics, were tested and could even improve overall accuracy by 1–2 % points. As a conclusion, we agree with previous studies that multitemporal imagery can help improving the mapping accuracy. However, leveraging satellite image time series for large-scale mapping of tree species should not only rely on high-quality cloud-free single images and should strongly be supported by i.e. seasonal composites or multi-image metrics. Therefore, development and provisioning of such datasets should be fostered.
A stacked ensemble model is developed for forecasting and analyzing the daily average concentrations of fine particulate matter (PM2.5) in Beijing, China. Special feature extraction procedures, ...including those of simplification, polynomial, transformation and combination, are conducted before modeling to identify potentially significant features based on an exploratory data analysis. Stability feature selection and tree-based feature selection methods are applied to select important variables and evaluate the degrees of feature importance. Single models including LASSO, Adaboost, XGBoost and multi-layer perceptron optimized by the genetic algorithm (GA-MLP) are established in the level 0 space and are then integrated by support vector regression (SVR) in the level 1 space via stacked generalization. A feature importance analysis reveals that nitrogen dioxide (NO2) and carbon monoxide (CO) concentrations measured from the city of Zhangjiakou are taken as the most important elements of pollution factors for forecasting PM2.5 concentrations. Local extreme wind speeds and maximal wind speeds are considered to extend the most effects of meteorological factors to the cross-regional transportation of contaminants. Pollutants found in the cities of Zhangjiakou and Chengde have a stronger impact on air quality in Beijing than other surrounding factors. Our model evaluation shows that the ensemble model generally performs better than a single nonlinear forecasting model when applied to new data with a coefficient of determination (R2) of 0.90 and a root mean squared error (RMSE) of 23.69μg/m3. For single pollutant grade recognition, the proposed model performs better when applied to days characterized by good air quality than when applied to days registering high levels of pollution. The overall classification accuracy level is 73.93%, with most misclassifications made among adjacent categories. The results demonstrate the interpretability and generalizability of the stacked ensemble model.
Display omitted
•Exploratory data analysis and feature extraction are conducted for comprehensive understanding of air quality forecasting.•Stability feature selection and tree based feature selection methods are applied to select important variables.•Stacked ensemble model is established to improve the generalizability and robustness.•Feature importance analysis is given to account for the interpretation of the ensemble model.•The proposed model outperforms other considered single models.
•Compressive and flexural strengths of SFRC are successfully predicted by machine learning algorithms.•Tree-based and boosting models are recommended for SFRC predictions.•W/C ratio and silica fume ...are most important parameters of predicting compressive strength.•Fiber volume fraction and silica fume are the most important for predicting flexural strength.•XGBoost and gradient boost regressors are selected as the most appropriate machine learning algorithms of SFRC.
Steel fiber-reinforced concrete (SFRC) has a performance superior to that of normal concrete because of the addition of discontinuous fibers. The development of strengths prediction technique of SFRC is, however, still in its infancy compared to that of normal concrete because of its complexity and limited available data. To overcome this limitation, research was conducted to develop an optimum machine learning algorithm for predicting the compressive and flexural strengths of SFRC. The resulting feature impact was also analyzed to confirm the reliability of the models. To achieve this, compressive and flexural strengths data from SFRC were collected through extensive literature reviews, and a database was created. Eleven machine learning algorithms were then established based on the dataset. K-fold validation was conducted to prevent overfitting, and the algorithms were regulated. The boosting- and tree-based models had the optimal performance, whereas the K-nearest neighbor, linear, ridge, lasso regressor, support vector regressor, and multilayer perceptron models had the worst performance. The water-to-cement ratio and silica fume content were the most influential factors in the prediction of compressive strength of SFRC, whereas the silica fume and fiber volume fraction most strongly influenced the flexural strength. Finally, it was found that, in general, the compressive strength prediction performance was better than the flexural strength prediction performance, regardless of the machine learning algorithm.
•23.28% of Korean adolescents were overweight or obese.•Nine machine learning-based models achieved accuracy of 0.7662 to 0.8403.•The study analyzed feature importance via machine learning ...methods.•Machine learning identifies a total of 22 factors behind Korean teen obesity.•Our study underscores the vital need for tailored and collective prevention programs.
Overweight and obesity in adolescents have been reported as one of the most serious threats worldwide including South Korea. This study aims to investigate the complex factors contributing to overweight and obesity in Korean adolescents using various machine learning methods. The research includes a dataset of 43,268 records from the 16th Korean Youth Risk Behavior Web-based Survey and explores 71 different factors, such as sociodemographic characteristics, dietary habits, health, behavior problems, family, and peer and school-related factors. Our analysis encompassed an array of algorithms, including Logistic Regression, Ridge, LASSO, Elasticnet, Decision tree, Bagging, Random forest, AdaBoost, and XGBoost. A total of nine machine learning models exhibited accuracy levels within the range of 0.7662 to 0.8403. Based on the domains and sub-domains of factors, it was determined that domains including sociodemographic characteristics, dietary habits, physical health, psychological health, behavioral problems, family factor, and peer and school factors were deemed important. Additionally, it is suggested that attention be given to newly-emerged features indicated by machine learning techniques, including oral health, smartphone addiction, smoking, sexual behavior, school violence, and nationality of parents. The current study's findings emphasize the critical need for collective and customized prevention programs considering multi-facet features to prevent overweight and obesity among Korean adolescents.
•A categorical boosting model is employed to intelligently forecast building energy consumption.•It raises accuracy and mitigates uncertainty in understanding building energy performance.•Feature ...importance can be measured to quantify features’ impacts on energy consumption.•Outlier detection can distinguish normal and abnormal energy usage to make early warnings.•Results will provide references to make data-driven decisions in optimizing energy utilization.
For better energy evaluation and management, a categorical boosting (CatBoost)-based predictive method is presented to accurately estimate building energy consumption by learning large volumes of multi-source heterogeneous data collected from buildings. To be specific, the newly-developed CatBoost model belonging to the ensemble learning has superiority in handling categorical variables and producing reliable results. As a case study, our proposed method is validated in a multi-dimensional dataset about Seattle's building energy performance provided by the city’s government, aiming to estimate the weather normalized site energy use intensity of buildings and characterize its non-linear relationship with other 12 possible influential features. Results from the 5-fold cross-validation demonstrate that the model exhibits a strong ability in predicting the exact value of energy intensity precisely, which can even outperform popular machine learning algorithms including random forest and gradient boosting decision tree under R2 of 0.897. Based on a defined threshold, these predicted values can be classified as the normal or abnormal energy consumption reaching an accuracy of 99.32% for outlier detection, which is helpful in alarming potential risks at an early stage and developing strategies to enhance the energy efficiency. Moreover, results from the established model can be interpreted objectively, suggesting that features concerning the physical and energy characteristics contribute more to energy estimation than environmental features. Since such results understand the building energy consumption and efficiency in a data-driven manner, they can eventually serve as guidance for building owners and designers in designing and renovating buildings to achieve better energy-conserving performance.
This study utilizes machine learning (ML) algorithms to develop a robust total organic carbon (TOC) prediction model for river waters in the Geumho River sub-basins, South Korea, considering both ...non-rain and rain events. The model incorporates geospatial parameters such as land use, slope, flow rate, and basic water quality metrics including biochemical oxygen demand (BOD), chemical oxygen demand (COD), total nitrogen (TN), total phosphorus (TP), and suspended solids (SS). A key aspect of this research is examining how land use information enhances the model's predictive accuracy. We compared two ML algorithms—extreme gradient boosting (XGBoost) and deep neural networks (DNN)—with a traditional multiple linear regression (MLR) approach. XGBoost outperformed the others, achieving an R2 value between 0.61 and 0.68 in the test dataset and demonstrating significant improvement during rain events with an R2 of 0.77 when including land use data. In contrast, this enhancement was not observed with the MLR model. Feature importance analysis using Shapley values highlighted COD as the primary predictor for non-rain events, while during rain events, COD, TP, TN, SS and agricultural land collectively influenced TOC levels. This study significantly advances understanding of TOC variability across different land use scenarios in river systems and underscores the importance of integrating geospatial and water quality parameters to enhance TOC prediction, particularly during rain events. This methodology provides a valuable framework for developing river management strategies and monitoring long-term TOC trends, especially in scenarios with gaps in essential monitoring data.
Display omitted
•ML models for TOC prediction in river sub-basins with geospatial parameters.•The integration of geospatial data significantly enhances TOC prediction accuracy.•XGBoost model shows superior performance over traditional statistical model, MLR.•Agricultural land enhances TOC prediction more during rain events than non-rain.