Understanding the mechanisms of pollutant removal in Wastewater Treatment Plants (WWTPs) is crucial for controlling effluent quality efficiently. However, the numerous treatment units, operational ...factors, and the underlying interactions between these units and factors usually obfuscate the comprehensive and precise understanding of the processes. We have previously proposed a machine learning (ML) framework to uncover complex cause-and-effect relationships in WWTPs. However, only one interpretable ML model, Random forest (RF), was studied and the interpretation method was not granular enough to reveal very detailed relationships between operational factors and effluent parameters. Thus, in this paper, we present an upgraded framework involving three interpretable tree-based models (RF, XGboost and LightGBM), three metrics (R2, Root mean squared error (RMSE), and Mean absolute error (MAE)) and a more advanced interpretation system SHapley Additive exPlanations (SHAP). Details of the framework are provided along with a demonstration of its practical applicability based on a case study of the Umeå WWTP in Sweden. Results show that, for both labels TSSe (Total suspended solids in effluent) and PO4e (Phosphate in effluent), the XGBoost models are optimal whereas the RF models are the least optimal, due to overfitting and polarized fitting. This study has yielded multiple new and significant findings with respect to the control of TSSe and PO4e in the Umeå WWTP and other similarly configured WWTPs. Additionally, this study has produced two important generic findings relating to ML applications for WWTPs (or even other process industries) in terms of cause-and-effect investigations. First, the model comparison should be carried out from multiple perspectives to ensure that underlying details are fully revealed and examined. Second, using a precise, robust, and granular (feature attribution available for individual instances) explanation method can bring extra insight into both model comparison and model interpretation. SHAP is recommended as we found it to be of great value in this study.
•Machine learning is applied to WWTP process to uncover detailed cause-and-effect information.•Multiple tree-based models are examined to select the optimal one for interpretation.•Multiple metrics are used to evaluate models' performances comprehensively.•For the first time SHapley Additive exPlanations is used for WWTP process analytics.•Results can help to develop advanced process control strategies.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
•Transfer learning allows for data-efficient molecular property prediction.•SHAP analysis proves the physical principles incorporated in machine learning.•Group additive predictions are excellent for ...transfer learning pretraining.•Transfer learning holds great potential to improve various group additive models.
The accuracy of thermochemical prediction methods is strongly dependent on the size of the set of training data. Group additivity is an interpretable modeling strategy that can be developed from a limited dataset, but fails to consider delocalized molecular effects such as inductive stabilization, delocalized resonance stabilization, and steric effects. In contrast, machine learning allows the incorporation of these effects but requires an extensive amount of high-quality data. Therefore, a new transfer learning approach is proposed, uniting group additivity with machine learning. First, a machine learning model is pretrained on a large set of group additive predictions, after which it is refined on a limited high-quality dataset with transfer learning. The proposed approach was tested to predict the standard enthalpy of formation, standard molar entropy, and heat capacity of a wide range of hydrocarbons, hydrocarbon radicals, and carbenium ions. By using transfer learning, chemically accurate predictions for hydrocarbons, radicals, and carbenium ions could be obtained, drastically reducing the group additive error using less than 450 molecular datapoints per model. A SHapley Additive exPlanations analysis reveals that a data-efficient but interpretable transfer learning methodology is obtained, achieving chemically accurate predictions for a wide range of hydrocarbons.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
PV power forecasting models are predominantly based on machine learning algorithms which do not provide any insight into or explanation about their predictions (black boxes). Therefore, their direct ...implementation in environments where transparency is required, and the trust associated with their predictions may be questioned. To this end, we propose a two stage probabilistic forecasting framework able to generate highly accurate, reliable, and sharp forecasts yet offering full transparency on both the point forecasts and the prediction intervals (PIs). In the first stage, we exploit natural gradient boosting (NGBoost) for yielding probabilistic forecasts, while in the second stage, we calculate the Shapley additive explanation (SHAP) values in order to fully comprehend why a prediction was made. To highlight the performance and the applicability of the proposed framework, real data from two PV parks located in Southern Germany are employed. Comparative results with two state-of-the-art algorithms, namely Gaussian process and lower upper bound estimation, manifest a significant increase in the point forecast accuracy and in the overall probabilistic performance. Most importantly, a detailed analysis of the model’s complex nonlinear relationships and interaction effects between the various features is presented. This allows interpreting the model, identifying some learned physical properties, explaining individual predictions, reducing the computational requirements for the training without jeopardizing the model accuracy, detecting possible bugs, and gaining trust in the model. Finally, we conclude that the model was able to develop complex nonlinear relationships which follow known physical properties as well as human logic and intuition.
•A probabilistic forecasting model using natural gradient boosting is proposed.•SHAP values are deployed to interpret both point forecasts and prediction intervals.•Nonlinear physics based interactions between features were learned.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
Radionuclide diffusion will be influenced by numerous factors. Establishing a model that can elucidate the internal correlation between mesoscopic diffusion and the microscopic structure of bentonite ...can enhance the comprehension of radionuclide diffusion mechanisms. In this study, a light gradient boosting machine (LightGBM) was employed to predict the effective diffusion coefficients of HCrO4−, I−, and CoEDTA2− in bentonite. The model's hyperparameters were optimized using the particle swarm optimization (PSO) algorithm. Several correlated physical quantities, such as mesoscopic parameters (total porosity, rock capacity factor, and ion molar conductivity) and microscopic parameters (ionic radius and montmorillonite stacking number) were incorporated to develop a machine learning model that incorporated micro- and meso-scale features. The predictive performance of PSO-LightGBM was verified using diffusion experiments, which investigated the diffusion of HCrO4−, I−, and CoEDTA2− at compacted dry densities of 1200–1800 kg/m3 using a through-diffusion method. Spearman correlation and Shapley additive explanation analyses revealed that the compacted dry density, ionic diffusion coefficient in water, ionic radius, and total porosity were the top-four influencing factors among the 16 input features. Partial dependence plot analysis elucidated the relationship between the effective diffusion coefficient and each input feature. The analysis results were consistent with the experimental findings, demonstrating the reliability of machine learning. Due to the incorporation of multi-scale features, the PSO-LightGBM model demonstrated enhanced predictive accuracy, linking the microstructure of bentonite to radionuclide diffusion, and providing a comprehensive interpretation of the diffusion mechanism.
Display omitted
•The diffusion of HCrO4−, I−, and CoEDTA2− in bentonite were investigated.•PSO-LightGBM was used to predict the effective diffusion coefficient.•The mesoscopic diffusion was related to the microscopic structure of bentonite.•Critical factors and their correlations to the diffusion were unveiled.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
AbstractRC shear walls are commonly used as lateral load-resisting elements in seismic regions, and the estimation of their shear strengths can become simultaneously design-critical and complex when ...they have so-called squat geometries, i.e., height-to-length ratios less than two. This paper presents a study on the training and interpretation of an advanced machine-learning model that strategically combines two algorithms for the said purpose. To train the model, a comprehensive shear strength database of 434 samples of squat RC walls is utilized. First, the eXtreme Gradient Boosting (XGBoost) algorithm is used to establish a predictive model for estimating the shear strength, wherein 70% and 30% of the data are respectively used for training and validation. This effort resulted in an approximately 97% validation accuracy, which well exceeds current mechanics-based/semiempirical models. Second, the SHapley Additive exPlanations (SHAP) algorithm is used to estimate the relative importance of the factors affecting XGBoost’s shear strength estimates. This step thus enabled physical and quantitative interpretations of the input-output dependencies, which are nominally hidden in conventional machine-learning approaches. Through this setup, several squat wall attributes are identified as being critical in shear strength estimates.
The accurate identification of spatial drivers is crucial for effectively managing soil heavy metals (SHM). However, understanding the complex and diverse spatial drivers of SHM and their interactive ...effects remains a significant challenge. In this study, we present a comprehensive analysis framework that integrates Geodetector, CatBoost, and SHapley Additive exPlanations (SHAP) techniques to identify and elucidate the interactive effects of spatial drivers in SHM within the Pearl River Delta (PRD) region of China. Our investigation incorporated fourteen environmental factors and focused on the pollution levels of three prominent heavy metals: Hg, Cd, and Zn. These findings provide several key insights: (1) The distribution of SHM is influenced by the combined effects of various individual factors and interactions within the source–flow–sink process. (2) Compared with the spatial interpretation of individual factors, the interaction between Hg and Cd exhibited enhanced spatial explanatory power. Similarly, interactions involving Zn mainly demonstrated increased spatial explanatory power, but there was one exception in which a weakening was observed. (3) Spatial heterogeneity plays a crucial role in determining the contributions of environmental factors to soil heavy metal concentrations. Although individual factors generally promote metal accumulation, their effects fluctuate when interactions are considered. (4) The SHAP interpretable method effectively addresses the limitations associated with machine-learning models by providing understandable insights into heavy metal pollution. This enables a comparison of the importance of environmental factors and elucidates their directional impacts, thereby aiding in the understanding of interaction mechanisms. The methods and findings presented in this study offer valuable insights into the spatial heterogeneity of heavy metal pollution in soil. By focusing on the effects of interactive factors, we aimed to develop more accurate strategies for managing SHM pollution.
Display omitted
•Geodetector found that spatial drivers of heavy metals interact to enhance explanatory power for spatial heterogeneity.•CatBoost effectively explained the relationship between three heavy metals and multiple environmental variables.•CatBoost-SHAP revealed the range, boundaries, and thresholds of key driving factors' interaction effects.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
In this paper, we address the problem of the interpretability of a machine learning model designed to predict air quality time series. When constructing a forecasting model, in addition to obtaining ...good accuracy, it is utterly important to understand why each prediction is made. Usually, interpreting the output of machine learning models is considered to be very difficult due to their complex “black box” architecture. However, we show how Shapley additive explanations can be used to interpret the outputs of a deep neural network designed to predict Nitrogen dioxide concentrations in Madrid. This method computes an estimation of the contribution of each feature for a particular prediction. Furthermore, we compare three explanatory methods to determine which one is more suitable for the air quality data and for the chosen machine learning model. A deeper insight into how the model behaves when predicting the pollution time series is obtained.
•A deep learning model is used to predict NO2 concentrations in the atmosphere of Madrid.•The resulting model is complex and, thus, hard to interpret.•Using the SHAP framework an explanation is derived for each prediction.•These explanations allow a deeper insight into how the model behaves in each case.•Results support previous intuitions about the relation of the meteorological variables with NO2.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
Display omitted
•Explainable landslide prediction was used for first time using time-series RS data.•SHAP was used to understand the black box of decisions that ML-based models make.•36 features ...derived from ALOS-PALSAR, ALOS-2 (SAR), Landsat-8, topo. maps and DEM.•Models were tested on 269 landslide locations in Chukha, Bhutan as a test site.•SHAP plots were developed to assess predictor interactions over RF and SVM.•XAI could measure the impact, interaction and correlation of factors within a model.
As artificial intelligence (AI) techniques are becoming more popular in landslide modeling, it is important to understand how decisions are made. Fairness, and transparency becomes ever more vital due to ethical concerns and trust. Despite the popularity of machine learning (ML) algorithms in landslide modeling, the explainability of these methods are often considered as black box. This paper aims to propose an explainable artificial intelligence (XAI) for landslide prediction using synthetic-aperture radar (SAR) time-series data, NDVI (normalized difference vegetation index) time-series data and other geo-environmental factors such as DEM (digital elevation model) derivatives. We employed a Shapley Additive Explanations (SHAP) approach to understand how and what decisions ML-based models are making. 37 features were extracted from various sources such as ALOS-PALSAR (ALOS Phased Array type l-band Synthetic Aperture Radar), ALOS-2 (SAR), Landsat-8, topographic maps, and DEM for landslide susceptibility mapping in a landslide prone area in Chukha, Bhutan as a test site. The result was then compared using two standard ML methods: random forest (RF) and support vector machine (SVM). As per results, the RF model outperformed (0.914) the SVM. Moreover, the higher reliability of the RF model was proved by the area under the curve (AUC) of 0.941. XAI results revealed, features like altitude, aspect, NDVI-2014, NDVI-2017, and NDVI-2018 were the most effective features for landslide prediction by both models. Interestingly, among those features, NDVI-2014, aspect, and NDVI-2017 negatively correlated with the landslide prediction; whereas positively correlated when SVM was utilized. This interpretation ability indicates the advantages of XAI over the conventional methods as it measures the impact, interaction and correlation of conditioning factors within a model. The current research finding can provide more transparency and explainability when working with MLs in landslide studies. This could help to build trust among the geoscientists and decision-makers while making geohazard prediction.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
A systematic understanding of the spatial distribution of water quality is critical for successful watershed management; however, the limited number of physical monitoring stations has restricted the ...evaluation of spatial water quality distribution and the identification of features impacting the water quality. To fill this gap, we developed a modeling process that employed the random forest regression (RFR) to model the water quality distribution for the Taihu Lake basin in Zhejiang Province, China, and adopted the Shapley Additive exPlanations (SHAP) method to interpret the underlying driving forces. We first used RFR to model three water quality parameters: permanganate index (CODMn), total phosphorus (TP), and total nitrogen (TN), based on 16 watershed features. We then applied the built models to generate water quality distribution maps for the basin, with the CODMn ranging from 1.39 to 6.40 mg/L, TP from 0.02 to 0.23 mg/L, and TN from 1.43 to 4.27 mg/L. These maps showed generally consistent patterns among the CODMn, TN, and TP with minor differences in the spatial distribution. The SHAP analysis showed that the TN was mainly affected by agricultural non-point sources, while the CODMn and TP were affected by agricultural and domestic sources. Due to differences in sewage collection and treatment between urban and rural areas, the water quality in highly populated urban areas was better than that in rural areas, which led to an unexpected positive relationship between water quality and population density. Overall, with the RFR models and SHAP interpretation, we obtained a continuous distribution pattern of the water quality and identified its driving forces in the basin. These findings provided important information to assist water quality restoration projects.
•Random forest regression (RFR) was effective in water quality prediction.•Driving forces for water quality were identified by the SHAP method.•Continuous distribution patterns of water quality were obtained for the basin.•Intensified agriculture was the main cause of poor water quality in the basin.•Water quality was not closely related to the population density.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP