This paper considers a timely information updating problem where an energy harvesting IoT receiver node interacts with an information source having a state-dependent time-varying update generation ...rate. This model and problem are motivated by the interaction of a random or controlled state change (represented by lazy and prolific modes) in monitoring a physical process and the ability of the IoT node to monitor and track in a timely fashion using harvested energy. Time is slotted and in every time slot, the energy harvesting IoT receiver node can either turn ON to receive status updates, if any, or turn OFF to save energy. With the aim of minimizing average age of information (AoI) at the receiving end with available state information, we determine the optimal ON-OFF scheduling policy of the EH receiver for the single unit capacity (infinite capacity) battery case through a Markov decision process (constrained Markov decision process) framework. We obtain resulting dynamic programming algorithms that yield optimal ON-OFF scheduling policies. Furthermore, we consider an age-threshold based scheme called "state-adapted waiting before turning ON" scheduling policy and obtain closed-form expressions of average AoI for the single-unit and infinite battery capacity cases. To study the effect of battery presence and optimal waiting time, we also consider the case of no battery and another policy that waits until the occurrence of state transition in the information source. We consistently observe in our numerical results that the average AoI of the state-adapted age-threshold based ON-OFF scheme matches those of the optimal policy.
Transfer reinforcement learning has gained significant traction in recent years as a critical research area, focusing on bolstering agents’ decision-making prowess by harnessing insights from ...analogous tasks. The primary transfer learning method involves identifying the appropriate source domains, sharing specific knowledge structures and subsequently transferring the shared knowledge to novel tasks. However, existing transfer methods exhibit a pronounced dependency on high task similarity and an abundance of source data. Consequently, we attempt to formulate a more efficacious approach that optimally exploits the previous learning experiences to direct an agent’s exploration as it learns new tasks. Specifically, we introduce a novel transfer learning paradigm rooted within the distance measure in the Markov chain, denoted as Distance Measure Substructure Transfer Reinforcement Learning (DMS-TRL). The core idea involves partitioning the Markov chain into the most basic small Markov units, which contain basic information about the agent’s transfer between two states, and then followed by employing a new distance measure technique to find the most similar structure, which is also the most suitable for transfer. Finally, we propose a policy transfer method to transfer knowledge through the Q table from the selected Markov unit to the target task. Through a series of experiments conducted on discrete Gridworld scenarios, we compare our approach with state-of-the-art learning methods. The results clearly illustrate that DMS-TRL can adeptly identify optimal policy in target tasks, exhibiting swifter convergence.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
In this letter, we study an unmanned aerial vehicle (UAV)-mounted mobile edge computing network, where the UAV executes computational tasks offloaded from mobile terminal users (TUs) and the motion ...of each TU follows a Gauss-Markov random model. To ensure the quality-of-service (QoS) of each TU, the UAV with limited energy dynamically plans its trajectory according to the locations of mobile TUs. Towards this end, we formulate the problem as a Markov decision process, wherein the UAV trajectory and UAV-TU association are modeled as the parameters to be optimized. To maximize the system reward and meet the QoS constraint, we develop a QoS-based action selection policy in the proposed algorithm based on double deep Q-network. Simulations show that the proposed algorithm converges more quickly and achieves a higher sum throughput than conventional algorithms.
Multiple aircraft collision avoidance is a challenging problem due to a stochastic environment and uncertainty in the intent of other aircraft. Traditionally a layered approach to collision avoidance ...has been employed using a centralized air traffic control system, established rules of the road, separation assurance, and last minute pairwise collision avoidance. With the advent of Urban Air Mobility (air taxis), the expected increase in traffic density in urban environments, short time scales, and small distances between aircraft favor decentralized decision making on-board the aircraft. In this paper, we present a Markov Decision Process (MDP) based method, named FastMDP, which can solve a certain subclass of MDPs quickly, and demonstrate using the algorithm online to safely maintain separation and avoid collisions with multiple aircraft (1-on-n) while remaining computationally efficient. We compare the FastMDP algorithm's performance against two online collision avoidance algorithms that have been shown to be both efficient and scale to large numbers of aircraft: Optimal Reciprocal Collision Avoidance (ORCA) and Monte Carlo Tree Search (MCTS). Our simulation results show that under the assumption that aircraft do not have perfect knowledge of other aircraft intent FastMDP outperforms ORCA and MCTS in collision avoidance behavior in terms of loss of separation and near mid-air collisions while being more computationally efficient. We further show that in our simulation FastMDP behaves nearly as well as MCTS with perfect knowledge of other aircraft intent. Our results show that FastMDP is a promising algorithm for collision avoidance that is also computationally efficient.
Update or Wait: How to Keep Your Data Fresh Sun, Yin; Uysal-Biyikoglu, Elif; Yates, Roy D. ...
IEEE transactions on information theory,
11/2017, Volume:
63, Issue:
11
Journal Article
Peer reviewed
Open access
In this paper, we study how to optimally manage the freshness of information updates sent from a source node to a destination via a channel. A proper metric for data freshness at the destination is ...the age-of-information, or simply age, which is defined as how old the freshest received update is, since the moment that this update was generated at the source node (e.g., a sensor). A reasonable update policy is the zero-wait policy, i.e., the source node submits a fresh update once the previous update is delivered, which achieves the maximum throughput and the minimum delay. Surprisingly, this zero-wait policy does not always minimize the age. This counter-intuitive phenomenon motivates us to study how to optimally control information updates to keep the data fresh and to understand when the zero-wait policy is optimal. We introduce a general age penalty function to characterize the level of dissatisfaction on data staleness and formulate the average age penalty minimization problem as a constrained semi-Markov decision problem with an uncountable state space. We develop efficient algorithms to find the optimal update policy among all causal policies and establish sufficient and necessary conditions for the optimality of the zero-wait policy. Our investigation shows that the zero-wait policy is far from the optimum if: 1) the age penalty function grows quickly with respect to the age; 2) the packet transmission times over the channel are positively correlated over time; or 3) the packet transmission times are highly random (e.g., following a heavy-tail distribution).
The intermittent nature of renewable energy resources such as wind and solar causes the energy supply to be less predictable leading to possible mismatches in the power network. To this end, hydrogen ...production and storage can provide a solution by increasing flexibility within the system. Stored hydrogen as compressed gas can either be converted back to electricity or it can be used as feed-stock for industry, heating for built environment, and as fuel for vehicles. This research is the first to examine optimal strategies for operating integrated energy systems consisting of renewable energy production and hydrogen storage with direct gas-based use-cases for hydrogen. Using Markov decision process theory, we construct optimal policies for day-to-day decisions on how much energy to store as hydrogen, or buy from or sell to the electricity market, and on how much hydrogen to sell for use as gas. We pay special emphasis to practical settings, such as contractually binding power purchase agreements, varying electricity prices, different distribution channels, green hydrogen offtake agreements, and hydrogen market price uncertainties. Extensive experiments and analysis are performed in the context of Northern Netherlands where Europe’s first Hydrogen Valley is being formed. Results show that gains in operational revenues of up to 51% are possible by introducing hydrogen storage units and competitive hydrogen market-prices. This amounts to a €126,000 increase in revenues per turbine per year for a 4.5 MW wind turbine. Moreover, our results indicate that hydrogen offtake agreements will be crucial in keeping the energy transition on track.
Display omitted
•We integrate green hydrogen production with the electricity and the hydrogen market.•We consider the profit-maximizing behavior of green hydrogen energy system operators.•We provide optimal state-dependent solutions via Markov decision process theory.•Including green hydrogen storage can increase operational revenues significantly.•Hydrogen offtake agreements will be crucial to keep the energy transition on track.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
Selfish mining attacks get a high prize due to the additional rewards unproportionate to their mining power (mining pools have particular advantages). Generally, this category of attacks stresses ...decreasing the threshold to maximize the rewards toward the view of attackers. Semi‐selfish mining falls into the family of selfish mining attacks, where the threshold value is approximately 15%. However, it gets little attention to implement these attacks in practical. In this paper, we focus on the validity of semi‐selfish mining attacks considering the probability of being detected. More specifically, we discuss mining strategies through backward deduction. That is to say that the attacking states derived from the observable states, which with normal forking rate, just as without semi‐selfish mining attacks, toward the view of the honest miners. Rewards distribution is further investigated concerning these strategies. The simulation results indicate that it does not necessarily bring rewards advantage over large pools. Instead, the small pools have an advantage over the additional rewards. However, the probability for small pools to successfully implement these strategies is pretty low. That is, it is impossible for the pools, although profitable for them, to sponsor semi‐selfish mining attacks without being detected.
Full text
Available for:
FZAB, GIS, IJS, KILJ, NLZOH, NUK, OILJ, SBCE, SBMB, UL, UM, UPUK
This paper shows that the optimal policy and value functions of a Markov Decision Process (MDP), either discounted or not, can be captured by a finite-horizon undiscounted Optimal Control Problem ...(OCP), even if based on an inexact model. This can be achieved by selecting a proper stage cost and terminal cost for the OCP. A very useful particular case of OCP is a Model Predictive Control (MPC) scheme where a deterministic (possibly nonlinear) model is used to reduce the computational complexity. This observation leads us to parameterize an MPC scheme fully, including the cost function. In practice, Reinforcement Learning algorithms can then be used to tune the parameterized MPC scheme. We verify the developed theorems analytically in an LQR case and we investigate some other nonlinear examples in simulations.
•Dynamic condition based mission abort policies for systems subject to degradation are developed.•The structural properties of the optimal abort policies are investigated.•The detailed comparison ...between the optimal policy and several heuristic policies are conducted.•Mission reliability and system survivability are derived under the proposed heuristic policies.
Safety-critical systems are commonly required to perform missions in various engineering fields. Failures of safety-critical systems may result in irretrievable economic losses and significant damages. To enhance the system survivability, mission abort is usually conducted if the failure risk becomes too high. This paper investigates the joint optimization of inspection and condition based mission abort policies for systems subject to continuous degradation. Dynamic mission abort decisions are considered based on the degradation level together with the time in mission. The problem is formulated within the framework of Markov decision process to minimize the expected costs of inspection, mission failure and system failure. In addition to deriving some structural properties, we also numerically evaluate several heuristic policies where mission reliability and system survivability are derived. Numerical studies are presented to validate the obtained results.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
In a system consisting of multiple functionally exchangeable units, differences in units’ degradation levels can significantly affect the system’s performance. Dynamic reallocation of these units can ...improve performance and prolong the lifetime of the system. This study is the first to quantify the benefit of incorporating reallocation into a condition-based maintenance framework for a 1-out-of-2 pairs balanced system—a system with two pairs of units that functions if there is at least one functioning pair. The balance condition requires that the two units in the same pair should be active or inactive at the same time. Unit degradation is modeled as a Gamma process and is inspected periodically. A Markov decision process model is developed to determine the optimal integrated reallocation and maintenance policy that minimizes the long-run average cost per unit time. A numerical study illustrates that the proposed integrated reallocation and maintenance policy significantly outperforms other limited policies, including reallocate-only and maintain-only policies.
•Joint optimization of reallocation and maintenance for balanced systems.•Both the reallocation and maintenance decisions are condition-based.•The system is modeled as a Markov Decision Process.•Insightful graphical presentation of the optimal policy.•The integration leads to cost savings of up to 10%.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP