Offloading cellular traffic via Device-to-Device communication (or D2D offloading) has been proved to be an effective way to ease the traffic burden of cellular networks. However, mobile nodes may ...not be willing to take part in D2D offloading without proper financial incentives since the data offloading process will incur a lot of resource consumption. Therefore, it is imminent to exploit effective incentive mechanisms to motivate nodes to participate in D2D offloading. Furthermore, the design of the content caching strategy is also crucial to the performance of D2D offloading. In this paper, considering these issues, a novel Incentive-driven and Deep Q Network (DQN) based Method, named IDQNM is proposed, in which the reverse auction is employed as the incentive mechanism. Then, the incentive-driven D2D offloading and content caching process is modeled as Integer Non-Linear Programming (INLP), aiming to maximize the saving cost of the Content Service Provider (CSP). To solve the optimization problem, the content caching method based on a Deep Reinforcement Learning (DRL) algorithm, named DQN is proposed to get the approximate optimal solution, and a standard Vickrey-Clarke-Groves (VCG)-based payment rule is proposed to compensate for mobile nodes' cost. Extensive real trace-driven simulation results demonstrate that the proposed IDQNM greatly outperforms other baseline methods in terms of the CSP's saving cost and the offloading rate in different scenarios.
This paper presents a novel composite obstacle avoidance control method to generate safe motion trajectories for autonomous systems in an adaptive manner. First, system safety is described using ...forward invariance, and the barrier function is encoded into the cost function such that the obstacle avoidance problem can be characterized by an infinite-horizon optimal control problem. Next, a safe reinforcement learning framework is proposed by combining model-based policy iteration and state-following-based approximation. Upon real-time data and extrapolated experience data, this learning design is implemented through the actor-critic structure, in which critic networks are tuned by gradient-descent adaption and actor networks produce adaptive control policies via gradient projection. Then, system stability and weight convergence are theoretically analyzed using Lyapunov method. Finally, the proposed learning-based controller is demonstrated on a two-dimensional single integrator system and a nonlinear unicycle kinematic system. Simulation results reveal that the system or agent can smoothly reach the target point while keeping a safe distance from each obstacle; at the same time, other three avoidance control methods are used to provide side-by-side comparisons and to verify some claimed advantages of the present method. Note to Practitioners -This paper is motivated by the obstacle avoidance problem of real-time navigation of an agent to the target point, which applies to practical autonomous systems such as vehicles and robots. Pre-generative methods and reactive methods have been widely employed to generate safe motion trajectories in the obstacle environment. However, these methods cannot strike a good balance between safety and optimality. In this paper, the obstacle avoidance problem is formulated in the sense of optimal control, and a safe reinforcement learning method is designed to generate safe motion trajectories. This method combines the advantages of model-based policy iteration and state-following-based approximation, in which the former ensures regional optimality while the latter ensures local safety. Based on the proposed adaptive tuning laws, engineers are able to design learning-based avoidance controllers in the environment with static obstacles. In future research, we will address the dynamic avoidance problem against moving obstacles.
Glaucoma is the leading cause of irreversible but preventable blindness worldwide, and visual field testing is an important tool for its diagnosis and monitoring. Testing using standard visual field ...thresholding procedures is time-consuming, and prolonged test duration leads to patient fatigue and decreased test reliability. Different visual field testing algorithms have been developed to shorten testing time while maintaining accuracy. However, the performance of these algorithms depends heavily on prior knowledge and manually crafted rules that determine the intensity of each light stimulus as well as the termination criteria, which is suboptimal. We leverage deep reinforcement learning to find improved decision strategies for visual field testing. In our proposed algorithms, multiple intelligent agents are employed to interact with the patient in an extensive-form game fashion, with each agent controlling the test on one of the testing locations in the patient's visual field. Through training, each agent learns an optimized policy that determines the intensities of light stimuli and the termination criteria, which minimizes the error in sensitivity estimation and test duration at the same time. In simulation experiments, we compare the performance of our algorithms against baseline visual field testing algorithms and show that our algorithms achieve a better trade-off between estimation accuracy and test duration. By retaining testing accuracy with reduced test duration, our algorithms improve test reliability, clinic efficiency, and patient satisfaction, and translationally affect clinical outcomes.
The reinforcement learning (RL) theory of the reward positivity (RewP), an event‐related potential (ERP) component that measures reward responsivity, suggests that the RewP should be largest when ...positive outcomes are unexpected and has been supported by work using appetitive outcomes (e.g., money). However, the RewP can also be elicited by the absence of aversive outcomes (e.g., shock). The limited work to‐date that has manipulated expectancy while using aversive outcomes has not supported the predictions of RL theory. Nonetheless, this work has been difficult to reconcile with the appetitive literature because the RewP was not observed as a reward signal in these studies, which used passive tasks that did not involve participant choice. Here, we tested the predictions of the RL theory by manipulating expectancy in an active/choice‐based threat‐of‐shock doors task that was previously found to elicit the RewP as a reward signal. Moreover, we used principal components analysis to isolate the RewP from overlapping ERP components. Eighty participants viewed pairs of doors surrounded by a red or green border; shock delivery was expected (80%) following red‐bordered doors and unexpected (20%) following green‐bordered doors. The RewP was observed as a reward signal (i.e., no shock > shock) that was not potentiated for unexpected feedback. In addition, the RewP was larger overall for unexpected (vs expected) feedback. Therefore, the RewP appears to reflect the additive (not interactive) effects of reward and expectancy, challenging the RL theory of the RewP, at least when reward is defined as the absence of an aversive outcome.
Reinforcement learning (RL) theory suggests that the reward positivity (RewP) should be largest when positive outcomes are unexpected, though this has only been shown using appetitive outcomes (e.g., money). Here, we defined reward as the absence of aversive outcomes (i.e., shock), while varying feedback expectancy. Results showed that the RewP was potentiated by the additive (not interactive) effects of absent aversive outcomes and unexpected feedback, challenging RL theory.
Advancements in steel reinforcement bending machines have allowed for the fabrication of continuously wound ties (CWTs). CWTs are being used in place of conventional transverse reinforcement to ...reduce waste and construction time and to alleviate congestion. A total of 20 reduced-scale special boundary elements (SBEs), using Grade 60 and Grade 80 conventional hoops and CWTs, were tested under uniaxial compression to evaluate the performance of members with CWTs. All the specimens exceeded ACI nominal axial strength capacity at zero eccentricity calculated using the measured material properties and ignoring reduction factors. The CWT specimens exhibited improved post-peak ductility compared to conventional hoops when all the current ACI requirements for SBE transverse reinforcement were satisfied. Post-peak ductility was further enhanced by using Grade 80 CWTs in conjunction with 10 ksi (69 MPa) concrete. Confined concrete strengths from three well-established models were reasonably close to the measured values; however, all the models were found to overestimate post-peak ductility regardless of the type of transverse reinforcement. Keywords: ACI 318; axial loading; confined concrete; ductility; high-strength reinforcement; hoops; seismic detailing; special boundary elements (SBEs); transverse reinforcement.
This paper investigates the reinforcement learning (RL) adaptive tracking control design problem for a class of mismatched stochastic nonlinear systems with non-affine structure. The stochastic ...system studied in this paper is more generally representative due to the presence of non-affine inputs, internal uncertainties, and mismatched external disturbances. Firstly, in order to solve the non-affine structure of stochastic systems, an extended stochastic differential equation is constructed. Based on the actor–critic framework, by generating reinforcement signals, the network is stimulated to evolve more quickly towards the desired direction, while compensating internal uncertainties and induced uncertainties in the stochastic system. Furthermore, aiming at the approximation errors and external disturbances, while satisfying stochastic stability, the adaptive laws for disturbances boundary estimator are established with the usage of higher powers of tracking errors. As a result, a novel non-affine adaptive tracking controller has been proposed by integrating RL, disturbances boundary estimation, and dynamic surface method. Through the stability analysis, it is proved that all closed-loop signals are bounded in probability, and the system output converges to a small neighborhood near the desired trajectory. Numerical simulations demonstrate the effectiveness and superiority of the proposed controller.
•Auxiliary integration obtain extended stochastic differential equation systems.•Actor-critic network reduces fitting error and improves environmental adaptability.•All signals Inclosed-loop are bounded in probability.
•A deep reinforcement learning based energy management strategy of PHEB is proposed.•The proposed approach fundamentally avoid the discretization error and the dimensionality curse.•The robustness of ...the proposed approach was verified by three typical driving cycles.•The results show that the proposed control strategy achieved similar performance and less computation load with DP.
Energy management is a fundamental task in hybrid electric vehicle community. Efficient energy management of hybrid electric vehicle is challenging owning to its enormous search space, multitudinous control variables and complicated driving conditions. Most existing methods apply discretization to approximate the continuous optimum in real driving conditions, which results in relatively low performance with the discretization error and curse of dimensionality. We introduce a novel energy management strategy with a deep reinforcement learning framework Actor-Critic to address these challenges. Actor-Critic uses a deep neural network, named as actor network, to directly output continuous control signals. Another deep neural network, named as critic network, evaluates the control signals generated by the actor network.The actor and critic neural network are trained by reinforcement learning from self-play in a continuous action space. Several comprehensive experiments are conducted in this paper, the proposed method surpasses discretization-based strategies by directly optimizing in the continuous space, which improves energy management performance while blackucing computation load. The simulation results indicate that the AC achieve the optimal energy distribution in comparison with the discretization-based strategies, especially surpassing the existing baseline DP by 5.5%, 2.9%, 9.5% in CTUDC, WVUCITY and WVUSUB in one-tenth of the computational cost.
Learning locomotion skills is a challenging problem. To generate realistic and smooth locomotion, existing methods use motion capture, finite state machines or morphology-specific knowledge to guide ...the motion generation algorithms. Deep reinforcement learning (DRL) is a promising approach for the automatic creation of locomotion control. Indeed, a standard benchmark for DRL is to automatically create a running controller for a biped character from a simple reward function Duan et al. 2016. Although several different DRL algorithms can successfully create a running controller, the resulting motions usually look nothing like a real runner. This paper takes a minimalist learning approach to the locomotion problem, without the use of motion examples, finite state machines, or morphology-specific knowledge. We introduce two modifications to the DRL approach that, when used together, produce locomotion behaviors that are symmetric, low-energy, and much closer to that of a real person. First, we introduce a new term to the loss function (not the reward function) that encourages symmetric actions. Second, we introduce a new curriculum learning method that provides modulated physical assistance to help the character with left/right balance and forward movement. The algorithm automatically computes appropriate assistance to the character and gradually relaxes this assistance, so that eventually the character learns to move entirely without help. Because our method does not make use of motion capture data, it can be applied to a variety of character morphologies. We demonstrate locomotion controllers for the lower half of a biped, a full humanoid, a quadruped, and a hexapod. Our results show that learned policies are able to produce symmetric, low-energy gaits. In addition, speed-appropriate gait patterns emerge without any guidance from motion examples or contact planning.
To explore the variances between single-reinforced and hybrid-reinforced Al–Mg–Si alloys with TiC and B4C, 10 wt.%-(TiC + B4C)/6061Al, 10 wt.%-B4C/6061Al, 10 wt.%-TiC/6061Al and 6061Al were prepared ...via wet mixing for 8 h followed by vacuum hot-press sintering at 580 °C and 30 MPa for 2 h. Microstructure evolution, mechanical properties, and wear behavior were investigated in this study. The results show that unlike single reinforcement of TiC, single reinforcement of B4C and hybrid reinforcement of TiC and B4C altered the distribution of Si and facilitated the in-situ formation of SiC. Both single reinforcement and hybrid reinforcement resulted in mixed fracture characteristics of ductile fracture and cleavage fracture, while hybrid reinforcement of TiC and B4C demonstrated superior strength and plasticity matching. At equivalent particle mass fractions, single reinforcement of TiC was more conducive to inducing recrystallization during sintering, followed by single reinforcement of B4C, and hybrid reinforcement of TiC and B4C had the least effect. At equivalent particle mass fractions, single reinforcement of B4C had the most significant effect on improving the wear resistance, followed by hybrid reinforcement of TiC and B4C, and single reinforcement of TiC had the least effect. All particle addition methods demonstrated a mixed wear mechanism involving abrasive wear, adhesive wear, and spalling wear. This study provides valuable insights into the development of aluminum matrix composites.
Humans and other animals often infer spurious associations among unrelated events. However, such superstitious learning is usually accounted for by conditioned associations, raising the question of ...whether an animal could develop more complex cognitive structures independent of reinforcement. Here, we tasked monkeys with discovering the serial order of two pictorial sets: a "learnable" set in which the stimuli were implicitly ordered and monkeys were rewarded for choosing the higher-rank stimulus and an "unlearnable" set in which stimuli were unordered and feedback was random regardless of the choice. We replicated prior results that monkeys reliably learned the implicit order of the learnable set. Surprisingly, the monkeys behaved as though some ordering also existed in the unlearnable set, showing consistent choice preference that transferred to novel untrained pairs in this set, even under a preference-discouraging reward schedule that gave rewards more frequently to the stimulus that was selected less often. In simulations, a model-free reinforcement learning algorithm (
-learning) displayed a degree of consistent ordering among the unlearnable set but, unlike the monkeys, failed to do so under the preference-discouraging reward schedule. Our results suggest that monkeys infer abstract structures from objectively random events using heuristics that extend beyond stimulus-outcome conditional learning to more cognitive model-based learning mechanisms.