Most environmental data come from a minority of well‐monitored sites. An ongoing challenge in the environmental sciences is transferring knowledge from monitored sites to unmonitored sites. Here, we ...demonstrate a novel transfer‐learning framework that accurately predicts depth‐specific temperature in unmonitored lakes (targets) by borrowing models from well‐monitored lakes (sources). This method, meta‐transfer learning (MTL), builds a meta‐learning model to predict transfer performance from candidate source models to targets using lake attributes and candidates' past performance. We constructed source models at 145 well‐monitored lakes using calibrated process‐based (PB) modeling and a recently developed approach called process‐guided deep learning (PGDL). We applied MTL to either PB or PGDL source models (PB‐MTL or PGDL‐MTL, respectively) to predict temperatures in 305 target lakes treated as unmonitored in the Upper Midwestern United States. We show significantly improved performance relative to the uncalibrated PB General Lake Model, where the median root mean squared error (RMSE) for the target lakes is 2.52°C. PB‐MTL yielded a median RMSE of 2.43°C; PGDL‐MTL yielded 2.16°C; and a PGDL‐MTL ensemble of nine sources per target yielded 1.88°C. For sparsely monitored target lakes, PGDL‐MTL often outperformed PGDL models trained on the target lakes themselves. Differences in maximum depth between the source and target were consistently the most important predictors. Our approach readily scales to thousands of lakes in the Midwestern United States, demonstrating that MTL with meaningful predictor variables and high‐quality source models is a promising approach for many kinds of unmonitored systems and environmental variables.
Key Points
Meta‐transfer learning (MTL) learns from models trained on data‐rich systems to inform predictions in systems where no observations exist
We use MTL with process‐based and process‐guided deep learning models to accurately predict lake temperatures in the Midwest United States
The most important predictor of transfer model success is the difference in maximum depth between the data‐rich and unmonitored lake
The rapid growth of data in water resources has created new opportunities to accelerate knowledge discovery with the use of advanced deep learning tools. Hybrid models that integrate theory with ...state‐of‐the art empirical techniques have the potential to improve predictions while remaining true to physical laws. This paper evaluates the Process‐Guided Deep Learning (PGDL) hybrid modeling framework with a use‐case of predicting depth‐specific lake water temperatures. The PGDL model has three primary components: a deep learning model with temporal awareness (long short‐term memory recurrence), theory‐based feedback (model penalties for violating conversation of energy), and model pretraining to initialize the network with synthetic data (water temperature predictions from a process‐based model). In situ water temperatures were used to train the PGDL model, a deep learning (DL) model, and a process‐based (PB) model. Model performance was evaluated in various conditions, including when training data were sparse and when predictions were made outside of the range in the training data set. The PGDL model performance (as measured by root‐mean‐square error (RMSE)) was superior to DL and PB for two detailed study lakes, but only when pretraining data included greater variability than the training period. The PGDL model also performed well when extended to 68 lakes, with a median RMSE of 1.65 °C during the test period (DL: 1.78 °C, PB: 2.03 °C; in a small number of lakes PB or DL models were more accurate). This case‐study demonstrates that integrating scientific knowledge into deep learning tools shows promise for improving predictions of many important environmental variables.
Key Points
Process‐Guided Deep Learning (PGDL) models integrate advanced empirical techniques with process knowledge
We used PGDL to accurately predict lake water temperatures for various conditions
PGDL performance improved significantly when pretraining data included diverse conditions generated by an existing process‐based model
The global decline of water quality in rivers and streams has resulted in a pressing need to design new watershed management strategies. Water quality can be affected by multiple stressors including ...population growth, land use change, global warming, and extreme events, with repercussions on human and ecosystem health. A scientific understanding of factors affecting riverine water quality and predictions at local to regional scales, and at sub‐daily to decadal timescales are needed for optimal management of watersheds and river basins. Here, we discuss how machine learning (ML) can enable development of more accurate, computationally tractable, and scalable models for analysis and predictions of river water quality. We review relevant state‐of‐the art applications of ML for water quality models and discuss opportunities to improve the use of ML with emerging computational and mathematical methods for model selection, hyperparameter optimization, incorporating process knowledge into ML models, improving explainablity, uncertainty quantification, and model‐data integration. We then present considerations for using ML to address water quality problems given their scale and complexity, available data and computational resources, and stakeholder needs. When combined with decades of process understanding, interdisciplinary advances in knowledge‐guided ML, information theory, data integration, and analytics can help address fundamental science questions and enable decision‐relevant predictions of riverine water quality.
Machine learning (ML) is being increasingly used for hydrological applications and has the potential to improve predictive capabilities and decipher complex, diverse human‐natural processes impacting water quality. In this paper, we review relevant state‐of‐the art models and present considerations for using ML and its limitations when applied for water quality problems. We then discuss opportunities to improve ML models using emerging computational and mathematical methods for model selection, hyperparameter optimization, incorporating process knowledge and complex data, explainable AI, uncertainty quantification, and model‐data integration.
Stream temperature (Ts) is an important water quality parameter that affects ecosystem health and human water use for beneficial purposes. Accurate Ts predictions at different spatial and temporal ...scales can inform water management decisions that account for the effects of changing climate and extreme events. In particular, widespread predictions of Ts in unmonitored stream reaches can enable decision makers to be responsive to changes caused by unforeseen disturbances. In this study, we demonstrate the use of classical machine learning (ML) models, support vector regression and gradient boosted trees (XGBoost), for monthly Ts predictions in 78 pristine and human-impacted catchments of the Mid-Atlantic and Pacific Northwest hydrologic regions spanning different geologies, climate, and land use. The ML models were trained using long-term monitoring data from 1980–2020 for three scenarios: (1) temporal predictions at a single site, (2) temporal predictions for multiple sites within a region, and (3) spatiotemporal predictions in unmonitored basins (PUB). In the first two scenarios, the ML models predicted Ts with median root mean squared errors (RMSE) of 0.69–0.84 °C and 0.92–1.02 °C across different model types for the temporal predictions at single and multiple sites respectively. For the PUB scenario, we used a bootstrap aggregation approach using models trained with different subsets of data, for which an ensemble XGBoost implementation outperformed all other modeling configurations (median RMSE 0.62 °C).The ML models improved median monthly Ts estimates compared to baseline statistical multi-linear regression models by 15–48% depending on the site and scenario. Air temperature was found to be the primary driver of monthly Ts for all sites, with secondary influence of month of the year (seasonality) and solar radiation, while discharge was a significant predictor at only 10 sites. The predictive performance of the ML models was robust to configuration changes in model setup and inputs, but was influenced by the distance to the nearest dam with RMSE <1 °C at sites situated greater than 16 and 44 km from a dam for the temporal single site and regional scenarios, and over 1.4 km from a dam for the PUB scenario. Our results show that classical ML models with solely meteorological inputs can be used for spatial and temporal predictions of monthly Ts in pristine and managed basins with reasonable (<1 °C) accuracy for most locations.
The dataset described here includes estimates of historical (1980–2020) daily surface water temperature, lake metadata, and daily weather conditions for lakes bigger than 4 ha in the conterminous ...United States (n = 185,549), and also in situ temperature observations for a subset of lakes (n = 12,227). Estimates were generated using a long short‐term memory deep learning model and compared to existing process‐based and linear regression models. Model training was optimized for prediction on unmonitored lakes through cross‐validation that held out lakes to assess generalizability and estimate error. On the held‐out lakes with in situ observations, median lake‐specific error was 1.24°C, and the overall root mean squared error was 1.61°C. This dataset increases the number of lakes with daily temperature predictions when compared to existing datasets, as well as substantially improves predictive accuracy compared to a prior empirical model and a debiased process‐based approach (2.01°C and 1.79°C median error, respectively).
This thesis provides a computer science audience with a review of machine learning techniques for modeling time series in unmonitored environmental systems with no available target data that have ...been published in recent years, and further includes three distinct research efforts applying these methods to real-world water resources prediction scenarios. Additionally, we identify several open questions for time series prediction in unmonitored sites that include incorporating dynamic inputs and site characteristics, mechanistic understanding, and explainable AI techniques in modern machine learning frameworks. This is motivated by the current state of environmental time series modeling seeing a vast increase in applications of various machine learning models, in particular deep learning models built using the growing availability of high performance computing resources. It remains difficult to predict environmental variables for which observations are concentrated in a minority of locations and most locations remain unmonitored, and although many machine learning-based approaches have been developed, there is often a lack of comparison between them. The increased attention to environmental prediction topics such as disaster response, water resources management, and climate change reveal a need to compare these approaches, and understand when and where they should be applied in unmonitored environmental prediction scenarios.