The H∞ control design problem is considered for nonlinear systems with unknown internal system model. It is known that the nonlinear H∞ control problem can be transformed into solving the so-called ...Hamilton-Jacobi-Isaacs (HJI) equation, which is a nonlinear partial differential equation that is generally impossible to be solved analytically. Even worse, model-based approaches cannot be used for approximately solving HJI equation, when the accurate system model is unavailable or costly to obtain in practice. To overcome these difficulties, an off-policy reinforcement leaning (RL) method is introduced to learn the solution of HJI equation from real system data instead of mathematical system model, and its convergence is proved. In the off-policy RL method, the system data can be generated with arbitrary policies rather than the evaluating policy, which is extremely important and promising for practical systems. For implementation purpose, a neural network (NN)-based actor-critic structure is employed and a least-square NN weight update algorithm is derived based on the method of weighted residuals. Finally, the developed NN-based off-policy RL method is tested on a linear F16 aircraft plant, and further applied to a rotational/translational actuator system.
Soft actor-critic (SAC) is an off-policy actor-critic (AC) reinforcement learning (RL) algorithm, essentially based on entropy regularization. SAC trains a policy by maximizing the trade-off between ...expected return and entropy (randomness in the policy). It has achieved the state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off-policy methods. SAC works in an off-policy fashion where data are sampled uniformly from past experiences (stored in a buffer) using which the parameters of the policy and value function networks are updated. We propose certain crucial modifications for boosting the performance of SAC and making it more sample efficient. In our proposed improved SAC (ISAC), we first introduce a new prioritization scheme for selecting better samples from the experience replay (ER) buffer. Second we use a mixture of the prioritized off-policy data with the latest on-policy data for training the policy and value function networks. We compare our approach with the vanilla SAC and some recent variants of SAC and show that our approach outperforms the said algorithmic benchmarks. It is comparatively more stable and sample efficient when tested on a number of continuous control tasks in MuJoCo environments.
Research often explores the role of scientific expertise in policymaking from an externalised perspective, mostly focusing on how policymakers use and abuse scientific expertise through political ...learning. However, very little is known about political learning by scientific experts. What strategies do they use to maintain and advance their access to, and influence on policymaking? Using process tracing, we illustrate how scientific experts' access to policymaking is challenged as a policy issue develops. We explore how this nudges scientific experts to engage in political learning and employ political advocacy strategies to enhance science's role in policy making, corresponding to evolving political opportunity structures. We empirically trace this using the case of EU climate policy development between 1990 and 2022. We identify three main sets of advocacy strategies used by scientific experts: Narrative and semantic (policy issue-oriented), Socialisation (Actors-oriented), and Governance (systems and structures-oriented). In doing so, this article illustrates the political actorness and agency of scientific experts and provides a supplementary understanding to the role of science in public policy and policy change, not only as a function of policymaker's instrumentalization of science, but also as a function of how scientific experts actively advocate for science's role in public policy.
Wind speed forecasting (WSF) is a viable option for increasing energy consumption efficiency. Previous forecasting methods rely on global accuracy, and the performance of these models changes with ...each time step due to local variations in wind characteristics, which is not ideal. Considering this problem, a novel dynamic selection of the best model (DSM) approach using reinforcement learning (RL) is proposed based on-policy state action reward state action (SARSA) for improved wind speed forecasting. DSM is defined as an RL problem and solved with an on-policy SARSA agent. The proposed approach is divided into a forecasting pool of models (FPM) and a learning agent, respectively. FPM comprises five robust forecasting approaches that have been trained and tuned. These models perform the WSF individually, and the SARSA agent is developed to perform the DSM for each step. The proposed approach is evaluated for 1 h ahead (1HA) WSF using two real-time wind speed datasets from Garden City, Manhattan, and Idalia, Colorado. This study provides a thorough examination of the proposed approach performance with an off-policy Q-learning algorithm for the DSM (QL-DSM). Compared to FPM’s models, the proposed SARSA-DSM approach enhanced prediction accuracy by 24.27% and 39.73% in two case studies. The proposed approach also improves 14.57% and 30.25% over the QL-DSM.
•A novel dynamic selection approach is proposed using on-policy SARSA algorithm.•Proposed approach consists of forecasting pool of models and a SARSA agent.•The proposed SARSA-DSM approach is evaluated for 1HA wind speed forecasting.•SARSA-DSM approach enhanced accuracy by 24.27% and 39.73% in two case studies.
AbstractPolicy learning plays a critical role in crisis policymaking. Adequate learning can lead to effective crisis responses, while misdirected learning can derail policymaking and lead to policy ...fiascos, potentially with devastating effects. However, creeping crises such as the recent COVID-19 pandemic pose significant challenges for doing “good” policy learning. Such crises pose persistent threats to societal values or life‐sustaining systems. They evolve across time and space while stirring significant political and societal tensions. Given their inherent features, they are often insufficiently addressed by policymakers. Taking the COVID-19 crisis as an illustrative example, this article aims to draw practitioners’ attention to key features of creeping crises and explains how such crises can undermine critical policy learning processes. It then discusses the need for “policy learning governance” as an approach to design, administer and manage crisis policy learning processes that are able to respond to continuous crisis evolutions. In doing so, it helps practitioners engage in adaptive and agile policy learning processes toward more effective learning by introducing four key principles of policy learning governance during creeping crises. Those are: identifying optimum learning modes and types, learning across disciplines, learning across space, and learning across time. Practical tools distilled from emerging research are then introduced to help apply the proposed principles of policy learning governance during future crises.
Policy learning using historical observational data are an important problem that has widespread applications. Examples include selecting offers, prices, or advertisements for consumers; choosing ...bids in contextual first-price auctions; and selecting medication based on patients’ characteristics. However, existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment that has generated the data: an assumption that is often false or too coarse an approximation. In this paper, we lift this assumption and aim to learn a distributionally robust policy with incomplete observational data. We first present a policy evaluation procedure that allows us to assess how well the policy does under worst-case environment shift. We then establish a central limit theorem type guarantee for this proposed policy evaluation scheme. Leveraging this evaluation scheme, we further propose a novel learning algorithm that is able to learn a policy that is robust to adversarial perturbations and unknown covariate shifts with a performance guarantee based on the theory of uniform convergence. Finally, we empirically test the effectiveness of our proposed algorithm in synthetic datasets and demonstrate that it provides the robustness that is missing using standard policy learning algorithms. We conclude the paper by providing a comprehensive application of our methods in the context of a real-world voting data set.
This paper was accepted by Hamid Nazerzadeh, data science.
Funding:
This work was supported by the National Science Foundation Grant CCF-2106508 and the Air Force Office of Scientific Research Award FA9550-20-1-0397. Z. Zhou also gratefully acknowledges the JP Morgan AI Research Grant and the New York University’s Center for Global Economy and Business faculty research grant for support on this work. Additional support is gratefully acknowledged from the National Science Foundation Grants 1915967 and 2118199.
Supplemental Material:
The data files and online appendix are available at
https://doi.org/10.1287/mnsc.2023.4678
.
This paper proposes an off-policy learning-based dynamic state feedback protocol that achieves the optimal synchronization of heterogeneous multi-agent systems (MAS) over a directed communication ...network. Note that most of the recent works on heterogeneous MAS are not formed in an optimal manner. By formulating the cooperative output regulation problem as an H∞ optimization problem, we can use reinforcement learning to find output synchronization protocols online along with the system trajectories without solving output regulator equations. In contrast to the existing optimal literature where leader’s states are assumed to be globally or distributively available for the communication, we only allow the relative system outputs to transmit through the network; namely, no leader’s states are needed now for the control or learning purpose.
•Tentative forms of governance are an empirically relevant phenomenon in many fields of emerging sciences and technology.•Tentative governance in practice is generally a matter of degree rather than ...a discrete phenomenon.•Tentative governance often operates in combination with more definitive modes of governance.•The mixture of tentative and definitive modes of governance involves a balancing act.
This conceptual introduction to the Special Section examines different modes of ‘tentative governance’ of Emerging Science and Technology (EST). The notion of tentative governance appears particularly relevant in the case of EST, given all the uncertainties and dynamics related to the scientific base, technologies, possible innovations, societal benefits and potential risks. While one may argue that such uncertainties are not peculiar to EST, it is nevertheless apparent that in industry, society and public policy the level of awareness of these uncertainties has increased, largely as a result of experiences with former emerging technologies (e.g. genetically modified organisms, nuclear technology). Governance is ‘tentative’ when public and private interventions are designed as a dynamic process that is prudent and preliminary rather than assertive and persistent. Tentative governance typically aims at creating spaces for probing and learning instead of stipulating definitive targets. The paper suggests a heuristic to position and relate the contributions to this Special Section. One main finding emerging from those contributions is that the inherent contingency of EST requires rather tentative approaches to governance, though often in combination with more definitive modes of governance, with the exact mixture involving a balancing act.
In a wide variety of applications, including healthcare, bidding in first price auctions, digital recommendations, and online education, it can be beneficial to learn a policy that assigns treatments ...to individuals based on their characteristics. The growing policy-learning literature focuses on settings in which policies are learned from historical data in which the treatment assignment rule is fixed throughout the data-collection period. However, adaptive data collection is becoming more common in practice from two primary sources: (1) data collected from adaptive experiments that are designed to improve inferential efficiency and (2) data collected from production systems that progressively evolve an operational policy to improve performance over time (e.g., contextual bandits). Yet adaptivity complicates the problem of learning an optimal policy ex post for two reasons: first, samples are dependent and, second, an adaptive assignment rule may not assign each treatment to each type of individual sufficiently often. In this paper, we address these challenges. We propose an algorithm based on generalized augmented inverse propensity weighted (AIPW) estimators, which nonuniformly reweight the elements of a standard AIPW estimator to control worst case estimation variance. We establish a finite-sample regret upper bound for our algorithm and complement it with a regret lower bound that quantifies the fundamental difficulty of policy learning with adaptive data. When equipped with the best weighting scheme, our algorithm achieves minimax rate-optimal regret guarantees even with diminishing exploration. Finally, we demonstrate our algorithm’s effectiveness using both synthetic data and public benchmark data sets.
This paper was accepted by Hamid Nazerzadeh, data science.
Funding: This work is supported by the National Science Foundation Grant CCF-2106508. R. Zhan was supported by Golub Capital and the Michael Yao and Sara Keying Dai AI and Digital Technology Fund. Z. Ren was supported by the Office of Naval Research Grant N00014-20-1-2337. S. Athey was supported by the Office of Naval Research Grant N00014-19-1-2468. Z. Zhou is generously supported by the New York University’s 2022–2023 Center for Global Economy and Business faculty research grant and the Digital Twin research grant from Bain & Company.
Supplemental Material: The data files are available at https://doi.org/10.1287/mnsc.2023.4921 .