In the field of Sequential Decision Making (SDM), two paradigms have historically vied for supremacy: Automated Planning (AP) and Reinforcement Learning (RL). In the spirit of reconciliation, this ...article reviews AP, RL and hybrid methods (e.g., novel learn to plan techniques) for solving Sequential Decision Processes (SDPs), focusing on their knowledge representation: symbolic, subsymbolic, or a combination. Additionally, it also covers methods for learning the SDP structure. Finally, we compare the advantages and drawbacks of the existing methods and conclude that neurosymbolic AI poses a promising approach for SDM, since it combines AP and RL with a hybrid knowledge representation.
Analyzing event-triggered control's (ETC) sampling behavior is of paramount importance, as it enables formal assessment of its sampling performance and prediction of its sampling patterns. In this ...work, we formally analyze the sampling behavior of stochastic linear periodic ETC (PETC) systems by computing bounds on associated metrics. Specifically, we consider functions over sequences of state measurements and intersampling times that can be expressed as average, multiplicative or cumulative rewards, and introduce their expectations as metrics on PETC's sampling behavior. We compute bounds on these expectations, by constructing Interval Markov Chains equipped with suitable reward functions, that abstract stochastic PETC's sampling behavior. Our results are illustrated on a numerical example, for which we compute bounds on the expected average intersampling time and on the probability of triggering with the maximum possible intersampling time in a finite horizon.
We study distributional reachability for finite Markov decision processes (MDPs) from a control theoretical perspective. Unlike standard probabilistic reachability notions, which are defined over MDP ...states or trajectories, in this article reachability is formulated over the space of probability distributions. We propose two set-valued maps for the forward and backward distributional reachability problems: the forward map collects all state distributions that can be reached from a set of initial distributions, while the backward map collects all state distributions that can reach a set of final distributions. We show that there exists a maximal invariant set under the forward map and this set is the region where the state distributions eventually always belong to, regardless of the initial state distribution and policy. The backward map provides an alternative way to solve a class of important problems for MDPs: the study of controlled invariance, the characterization of the domain of attraction, and reach-avoid problems. Three case studies illustrate the effectiveness of our approach.
This paper aims to incorporate safety specifications into Markov decision processes. Explicitly, we address the minimization problem up to a stopping time with safety constraints. We establish a ...formalism leaning upon the evolution equation to achieve our goal. We show how to compute the safety function with dynamic programming. In the last part of the paper, we develop several algorithms for safe stochastic optimisation using linear and dynamic programming.
We consider scenarios where a swarm of unmanned vehicles (UxVs) seek to satisfy a number of diverse, spatially distributed objectives. The UxVs strive to determine an efficient plan to service the ...objectives while operating in a coordinated fashion. We focus on developing autonomous high-level planning, where low-level controls are leveraged from previous work in distributed motion, target tracking, localization, and communication. We rely on the use of state and action abstractions in a Markov decision processes framework to introduce a hierarchical algorithm,
Dynamic Domain Reduction for Multi-Agent Planning
, that enables multi-agent planning for large multi-objective environments. Our analysis establishes the correctness of our search procedure within specific subsets of the environments, termed ‘sub-environment’ and characterizes the algorithm performance with respect to the optimal trajectories in single-agent and sequential multi-agent deployment scenarios using tools from submodularity. Simulated results show significant improvement over using a standard
Monte Carlo tree search
in an environment with large state and action spaces.
We tackle the problem of finding optimal policies for Markov Decision Processes, that minimize the probability of the cumulative cost exceeding a given budget. Such task falls under the umbrella of ...Risk-Sensitive Markov Decision Processes, which optimize a non-additive, non-linear function of cumulative cost that incorporates the user's attitude towards risk. Current algorithms for solving that task, for any budget equal or smaller than an user-defined budget, scale poorly when the support of the cost function is large, since they operate in an augmented state space which enumerates all possible remaining budgets. To circumvent this issue, we develop (i) an improved version of the Topological Value Iteration with Dynamic Programming algorithm (tvi-dp), and (ii) the first symbolic dynamic programming algorithm for this class of problems, called rs-spudd, that exploits conditional independence in the transition function in the augmented state space. The proposed algorithms improve efficiency by pruning irrelevant states and terminating early, without sacrificing optimality. Empirical results show that rs-spudd is able to solve problems up to 103 times larger than tvi-dp.
This paper deals with the embedding framework of Markov decision processes (MDPs) with discrete state and action space to find optimal actions. The optimal control problem of MDPs can be efficiently ...tackled by restructuring the same into an equivalent linearly-solvable Markov decision processes (LMDPs) through the method called embedding. However, state costs under the embedding may not exactly match the original costs and even assume unrealistic values. In this work, we derive a constructive sufficient condition to devise an exact embedding solution rendering the embedded state cost to match the original system. Furthermore, since, in this case, the embedding implies a transition from the discrete to continuous action space, the correlation between the obtained continuous action and an equivalent desired discrete action is investigated using a maximum a posteriori probability-based method. Finally, some examples, including mammalian cell-cycle network, are presented to demonstrate the effectiveness of the proposed method.
We consider a transmitter-receiver pair in a slotted-time system. The transmitter observes a dynamic source and sends updates to a remote receiver through an error-free communication channel that ...suffers a random delay. We consider two cases. In the first case, the update is guaranteed to be delivered within a certain number of time slots. In the second case, the update is immediately discarded once the transmission time exceeds a predetermined value. The receiver estimates the state of the dynamic source using the received updates. In this paper, we adopt the Age of Incorrect Information (AoII) as the performance metric and investigate the problem of optimizing the transmitter's action in each time slot to minimize AoII. We first characterize the optimization problem using the Markov decision process and investigate the performance of the threshold policy, under which the transmitter transmits updates only when the transmission is allowed and the AoII exceeds the threshold <inline-formula> <tex-math notation="LaTeX">\tau</tex-math> </inline-formula>. By delving into the characteristics of the system evolution, we precisely compute the expected AoII achieved by the threshold policy using the Markov chain. Then, we prove that the optimal policy exists. Furthermore, by leveraging the policy improvement theorem, we theoretically prove that, under an easily verifiable condition, the optimal policy is the threshold policy with <inline-formula> <tex-math notation="LaTeX">\tau=1</tex-math> </inline-formula>. Finally, numerical results are presented to highlight the performance of the optimal policy.
Stochastic hybrid systems have received significant attentions as a relevant modeling framework describing many systems, from engineering to the life sciences: they enable the study of numerous ...applications, including transportation networks, biological systems and chemical reaction networks, smart energy and power grids, and beyond. Automated verification and policy synthesis for stochastic hybrid systems can be inherently challenging: this is due to the heterogeneity of their dynamics (presence of continuous and discrete components), the presence of uncertainty, and in some applications the large dimension of state and input sets. Over the past few years, a few hundred articles have investigated these models, and developed diverse and powerful approaches to mitigate difficulties encountered in the analysis and synthesis of such complex stochastic systems. In this survey, we overview the most recent results in the literature and discuss different approaches, including (in)finite abstractions, verification and synthesis for temporal logic specifications, stochastic similarity relations, (control) barrier certificates, compositional techniques, and a selection of results on continuous-time stochastic systems; we finally survey recently developed software tools that implement the discussed approaches. Throughout the manuscript we discuss a few open topics to be considered as potential future research directions: we hope that this survey will guide younger researchers through a comprehensive understanding of the various challenges, tools, and solutions in this enticing and rich scientific area.
The effective operation of time-critical Internet of things (IoT) applications requires real-time reporting of fresh status information of underlying physical processes. In this paper, a real-time ...IoT monitoring system is considered, in which the IoT devices sample a physical process with a sampling cost and send the status packet to a given destination with an updating cost. This joint status sampling and updating process is designed to minimize the average age of information (AoI) at the destination node under an average energy cost constraint at each device. This stochastic problem is formulated as an infinite horizon average cost constrained Markov decision process (CMDP) and transformed into an unconstrained Markov decision process (MDP) using a Lagrangian method. For the single IoT device case, the optimal policy for the CMDP is shown to be a randomized mixture of two deterministic policies for the unconstrained MDP, which is of threshold type. This reveals a fundamental tradeoff between the average AoI at the destination and the sampling and updating costs. Then, a structure-aware optimal algorithm to obtain the optimal policy of the CMDP is proposed and the impact of the wireless channel dynamics is studied while demonstrating that channels having a larger mean channel gain and less scattering can achieve better AoI performance. For the case of multiple IoT devices, a low-complexity semi-distributed suboptimal policy is proposed with the updating control at the destination and the sampling control at each IoT device. Then, an online learning algorithm is developed to obtain this policy, which can be implemented at each IoT device and requires only the local knowledge and small signaling from the destination. The proposed learning algorithm is shown to converge almost surely to the suboptimal policy. Simulation results show the structural properties of the optimal policy for the single IoT device case; and show that the proposed policy for multiple IoT devices outperforms a zero-wait baseline policy, with average AoI reductions reaching up to 33%.