This paper considers a timely information updating problem where an energy harvesting IoT receiver node interacts with an information source having a state-dependent time-varying update generation ...rate. This model and problem are motivated by the interaction of a random or controlled state change (represented by lazy and prolific modes) in monitoring a physical process and the ability of the IoT node to monitor and track in a timely fashion using harvested energy. Time is slotted and in every time slot, the energy harvesting IoT receiver node can either turn ON to receive status updates, if any, or turn OFF to save energy. With the aim of minimizing average age of information (AoI) at the receiving end with available state information, we determine the optimal ON-OFF scheduling policy of the EH receiver for the single unit capacity (infinite capacity) battery case through a Markov decision process (constrained Markov decision process) framework. We obtain resulting dynamic programming algorithms that yield optimal ON-OFF scheduling policies. Furthermore, we consider an age-threshold based scheme called "state-adapted waiting before turning ON" scheduling policy and obtain closed-form expressions of average AoI for the single-unit and infinite battery capacity cases. To study the effect of battery presence and optimal waiting time, we also consider the case of no battery and another policy that waits until the occurrence of state transition in the information source. We consistently observe in our numerical results that the average AoI of the state-adapted age-threshold based ON-OFF scheme matches those of the optimal policy.
In this letter, we study an unmanned aerial vehicle (UAV)-mounted mobile edge computing network, where the UAV executes computational tasks offloaded from mobile terminal users (TUs) and the motion ...of each TU follows a Gauss-Markov random model. To ensure the quality-of-service (QoS) of each TU, the UAV with limited energy dynamically plans its trajectory according to the locations of mobile TUs. Towards this end, we formulate the problem as a Markov decision process, wherein the UAV trajectory and UAV-TU association are modeled as the parameters to be optimized. To maximize the system reward and meet the QoS constraint, we develop a QoS-based action selection policy in the proposed algorithm based on double deep Q-network. Simulations show that the proposed algorithm converges more quickly and achieves a higher sum throughput than conventional algorithms.
Transfer reinforcement learning has gained significant traction in recent years as a critical research area, focusing on bolstering agents’ decision-making prowess by harnessing insights from ...analogous tasks. The primary transfer learning method involves identifying the appropriate source domains, sharing specific knowledge structures and subsequently transferring the shared knowledge to novel tasks. However, existing transfer methods exhibit a pronounced dependency on high task similarity and an abundance of source data. Consequently, we attempt to formulate a more efficacious approach that optimally exploits the previous learning experiences to direct an agent’s exploration as it learns new tasks. Specifically, we introduce a novel transfer learning paradigm rooted within the distance measure in the Markov chain, denoted as Distance Measure Substructure Transfer Reinforcement Learning (DMS-TRL). The core idea involves partitioning the Markov chain into the most basic small Markov units, which contain basic information about the agent’s transfer between two states, and then followed by employing a new distance measure technique to find the most similar structure, which is also the most suitable for transfer. Finally, we propose a policy transfer method to transfer knowledge through the Q table from the selected Markov unit to the target task. Through a series of experiments conducted on discrete Gridworld scenarios, we compare our approach with state-of-the-art learning methods. The results clearly illustrate that DMS-TRL can adeptly identify optimal policy in target tasks, exhibiting swifter convergence.
Multiple aircraft collision avoidance is a challenging problem due to a stochastic environment and uncertainty in the intent of other aircraft. Traditionally a layered approach to collision avoidance ...has been employed using a centralized air traffic control system, established rules of the road, separation assurance, and last minute pairwise collision avoidance. With the advent of Urban Air Mobility (air taxis), the expected increase in traffic density in urban environments, short time scales, and small distances between aircraft favor decentralized decision making on-board the aircraft. In this paper, we present a Markov Decision Process (MDP) based method, named FastMDP, which can solve a certain subclass of MDPs quickly, and demonstrate using the algorithm online to safely maintain separation and avoid collisions with multiple aircraft (1-on-n) while remaining computationally efficient. We compare the FastMDP algorithm's performance against two online collision avoidance algorithms that have been shown to be both efficient and scale to large numbers of aircraft: Optimal Reciprocal Collision Avoidance (ORCA) and Monte Carlo Tree Search (MCTS). Our simulation results show that under the assumption that aircraft do not have perfect knowledge of other aircraft intent FastMDP outperforms ORCA and MCTS in collision avoidance behavior in terms of loss of separation and near mid-air collisions while being more computationally efficient. We further show that in our simulation FastMDP behaves nearly as well as MCTS with perfect knowledge of other aircraft intent. Our results show that FastMDP is a promising algorithm for collision avoidance that is also computationally efficient.
The intermittent nature of renewable energy resources such as wind and solar causes the energy supply to be less predictable leading to possible mismatches in the power network. To this end, hydrogen ...production and storage can provide a solution by increasing flexibility within the system. Stored hydrogen as compressed gas can either be converted back to electricity or it can be used as feed-stock for industry, heating for built environment, and as fuel for vehicles. This research is the first to examine optimal strategies for operating integrated energy systems consisting of renewable energy production and hydrogen storage with direct gas-based use-cases for hydrogen. Using Markov decision process theory, we construct optimal policies for day-to-day decisions on how much energy to store as hydrogen, or buy from or sell to the electricity market, and on how much hydrogen to sell for use as gas. We pay special emphasis to practical settings, such as contractually binding power purchase agreements, varying electricity prices, different distribution channels, green hydrogen offtake agreements, and hydrogen market price uncertainties. Extensive experiments and analysis are performed in the context of Northern Netherlands where Europe’s first Hydrogen Valley is being formed. Results show that gains in operational revenues of up to 51% are possible by introducing hydrogen storage units and competitive hydrogen market-prices. This amounts to a €126,000 increase in revenues per turbine per year for a 4.5 MW wind turbine. Moreover, our results indicate that hydrogen offtake agreements will be crucial in keeping the energy transition on track.
Display omitted
•We integrate green hydrogen production with the electricity and the hydrogen market.•We consider the profit-maximizing behavior of green hydrogen energy system operators.•We provide optimal state-dependent solutions via Markov decision process theory.•Including green hydrogen storage can increase operational revenues significantly.•Hydrogen offtake agreements will be crucial to keep the energy transition on track.
Selfish mining attacks get a high prize due to the additional rewards unproportionate to their mining power (mining pools have particular advantages). Generally, this category of attacks stresses ...decreasing the threshold to maximize the rewards toward the view of attackers. Semi‐selfish mining falls into the family of selfish mining attacks, where the threshold value is approximately 15%. However, it gets little attention to implement these attacks in practical. In this paper, we focus on the validity of semi‐selfish mining attacks considering the probability of being detected. More specifically, we discuss mining strategies through backward deduction. That is to say that the attacking states derived from the observable states, which with normal forking rate, just as without semi‐selfish mining attacks, toward the view of the honest miners. Rewards distribution is further investigated concerning these strategies. The simulation results indicate that it does not necessarily bring rewards advantage over large pools. Instead, the small pools have an advantage over the additional rewards. However, the probability for small pools to successfully implement these strategies is pretty low. That is, it is impossible for the pools, although profitable for them, to sponsor semi‐selfish mining attacks without being detected.
•Dynamic condition based mission abort policies for systems subject to degradation are developed.•The structural properties of the optimal abort policies are investigated.•The detailed comparison ...between the optimal policy and several heuristic policies are conducted.•Mission reliability and system survivability are derived under the proposed heuristic policies.
Safety-critical systems are commonly required to perform missions in various engineering fields. Failures of safety-critical systems may result in irretrievable economic losses and significant damages. To enhance the system survivability, mission abort is usually conducted if the failure risk becomes too high. This paper investigates the joint optimization of inspection and condition based mission abort policies for systems subject to continuous degradation. Dynamic mission abort decisions are considered based on the degradation level together with the time in mission. The problem is formulated within the framework of Markov decision process to minimize the expected costs of inspection, mission failure and system failure. In addition to deriving some structural properties, we also numerically evaluate several heuristic policies where mission reliability and system survivability are derived. Numerical studies are presented to validate the obtained results.
This work explores reinforcement learning (RL) for on-board planning and scheduling of an agile Earth-observing satellite (AEOS). In this formulation of the AEOS scheduling problem, a spacecraft in ...low-Earth orbit attempts to maximize the weighted sum of targets collected and downlinked. Reinforcement learning is both a class of problems and solution methods that involves learning how to map situations to actions to maximize a reward function through repeated interactions with an environment. Reinforcement learning problems are formulated as Markov decision processes (MDPs), which are formalizations of sequential decision making problems. In this work, the agile EOS Scheduling problem is formulated as a Markov decision process (MDP) where the number of upcoming imaging targets included in the action space is an adjustable parameter to account for clusters of imaging targets with varying priorities. Unlike prior Earth-observing satellite scheduling MDP formulations, this work explores how the size of the action space can be reduced to produce generalized policies that may be executed on board the spacecraft in seconds without sacrificing performance. Monte Carlo tree search (MCTS) and supervised learning are used to train a set of agents with varying numbers of targets in the action space. Monte Carlo tree search is an online search algorithm that was originally developed to solve two player games but has since been applied to solve reinforcement learning problems. Monte Carlo tree search solves reinforcement learning problems by simulating interactions with the environment, building an estimate of the state-action value function to select the next best action. In this work, Monte Carlo tree search is used to generate training data, and supervised learning is applied over the state-action value estimates generated by MCTS to solve for a generalized policy, which is used on-board the spacecraft to map states to actions. Two backup strategies are explored for MCTS - an incremental averaging operator and a maximization operator. For all backup operators, performance asymptotically increases as the number of targets in the action space approaches the maximum number of available targets. A benchmark is computed with Monte Carlo tree search to determine an upper bound on performance. Furthermore, MCTS is compared to solutions generated by a genetic algorithm. For all numbers of imaging targets in the action space, MCTS demonstrates a 2-5% increase in average reward at 10-20% of the single core wall clock time of the genetic algorithm. A search of various neural network hyperparameters is presented, and the trained neural networks are shown to approximate the MCTS policy with three orders of magnitude less execution time. Finally, the trained agents and the genetic algorithm are deployed on varying target densities for comparison purposes and to demonstrate robustness to mission profiles outside of the training distribution.