The problem of off-policy evaluation (OPE) has long been advocated as one of the foremost challenges in reinforcement learning. Gradient-based and emphasis-based temporal-difference (TD) learning ...algorithms comprise the major part of off-policy TD learning methods. In this work, we investigate the derivation of efficient OPE algorithms from a novel perspective based on the advantages of these two categories. The gradient-based framework is adopted, and the emphatic approach is used to improve convergence performance. We begin by proposing a new analogue of the on-policy objective, called the distribution-correction-based mean square projected Bellman error (DC-MSPBE). The key to the construction of DC-MSPBE is the use of emphatic weightings on the representable subspace of the original MSPBE. Based on this objective function, the emphatic TD with lower-variance gradient correction (ETD-LVC) algorithm is proposed. Under standard off-policy and stochastic approximation conditions, we provide the convergence analysis of the ETD-LVC algorithm in the case of linear function approximation. Further, we generalize the algorithm to nonlinear smooth function approximation. Finally, we empirically demonstrate the improved performance of our ETD-LVC algorithm on off-policy benchmarks. Taken together, we hope that our work can guide the future discovery of a better alternative in the off-policy TD learning algorithm family.
Rethinking dopamine as generalized prediction error Gardner, Matthew P H; Schoenbaum, Geoffrey; Gershman, Samuel J
Proceedings of the Royal Society. B, Biological sciences,
11/2018, Volume:
285, Issue:
1891
Journal Article
Peer reviewed
Open access
Midbrain dopamine neurons are commonly thought to report a reward prediction error (RPE), as hypothesized by reinforcement learning (RL) theory. While this theory has been highly successful, several ...lines of evidence suggest that dopamine activity also encodes sensory prediction errors unrelated to reward. Here, we develop a new theory of dopamine function that embraces a broader conceptualization of prediction errors. By signalling errors in both sensory and reward predictions, dopamine supports a form of RL that lies between model-based and model-free algorithms. This account remains consistent with current canon regarding the correspondence between dopamine transients and RPEs, while also accounting for new data suggesting a role for these signals in phenomena such as sensory preconditioning and identity unblocking, which ostensibly draw upon knowledge beyond reward predictions.
This paper is concerned with the problem of policy evaluation with linear function approximation in discounted infinite horizon Markov decision processes. We investigate the sample complexities ...required to guarantee a predefined estimation error of the best linear coefficients for two widely-used policy evaluation algorithms: the temporal difference (TD) learning algorithm and the two-timescale linear TD with gradient correction (TDC) algorithm. In both the on-policy setting, where observations are generated from the target policy, and the off-policy setting, where samples are drawn from a behavior policy potentially different from the target policy, we establish the first sample complexity bound with high-probability convergence guarantee that attains the optimal dependence on the tolerance level. We also exhibit an explicit dependence on problem-related quantities, and show in the on-policy setting that our upper bound matches the minimax lower bound on crucial problem parameters, including the choice of the feature map and the problem dimension.
To behave adaptively, we must learn from the consequences of our actions. Studies using event-related potentials (ERPs) have been informative with respect to the question of how such learning occurs. ...These studies have revealed a frontocentral negativity termed the feedback-related negativity (FRN) that appears after negative feedback. According to one prominent theory, the FRN tracks the difference between the values of actual and expected outcomes, or reward prediction errors. As such, the FRN provides a tool for studying reward valuation and decision making. We begin this review by examining the neural significance of the FRN. We then examine its functional significance. To understand the cognitive processes that occur when the FRN is generated, we explore variables that influence its appearance and amplitude. Specifically, we evaluate four hypotheses: (1) the FRN encodes a quantitative reward prediction error; (2) the FRN is evoked by outcomes and by stimuli that predict outcomes; (3) the FRN and behavior change with experience; and (4) the system that produces the FRN is maximally engaged by volitional actions.
The operation of a community energy storage system (CESS) is challenging due to the volatility of photovoltaic distributed generation, electricity consumption, and energy prices. Selecting the ...optimal CESS setpoints during the day is a sequential decision problem under uncertainty, which can be solved using dynamic learning methods. This paper proposes a reinforcement learning (RL) technique based on temporal difference learning with eligibility traces (ET). It aims to minimize the day-ahead energy costs while maintaining the technical limits at the grid coupling point. The performance of the RL is compared against an oracle based on a deterministic mixed-integer second-order constraint program (MISOCP). The use of ET boosts the RL agent learning rate for the CESS operation problem. The ET effectively assigns credit to the action sequences that bring the CESS to a high state of charge before the peak prices, reducing the training time. The case study shows that the proposed method learns to operate the CESS effectively and ten times faster than common RL algorithms applied to energy systems such as Tabular Q-learning and Fitted-Q. Also, the RL agent operates the CESS 94% near the optimal, reducing the energy costs for the end-user up to 12%.
•Reinforcement learning for energy storage operation to reduce energy costs.•The operation satisfies electrical distribution grid’s technical constraints.•The technique uses a linear function approximator with eligibility traces.•Discussion of advantages of using eligibility traces in energy storage operations.
Animals make predictions based on currently available information. In natural settings, sensory cues may not reveal complete information, requiring the animal to infer the “hidden state” of the ...environment. The brain structures important in hidden state inference remain unknown. A previous study showed that midbrain dopamine neurons exhibit distinct response patterns depending on whether reward is delivered in 100% (task 1) or 90% of trials (task 2) in a classical conditioning task. Here we found that inactivation of the medial prefrontal cortex (mPFC) affected dopaminergic signaling in task 2, in which the hidden state must be inferred (“will reward come or not?”), but not in task 1, where the state was known with certainty. Computational modeling suggests that the effects of inactivation are best explained by a circuit in which the mPFC conveys inference over hidden states to the dopamine system.
Display omitted
•Dopamine reward prediction errors (RPEs) reflect hidden state inference•Medial prefrontal cortex (mPFC) shapes RPEs in a task involving hidden states•mPFC is not needed to compute RPEs in a similar task when states are fully observable•Modeling suggests that mPFC computes a probability distribution over hidden states
Dopamine neurons signal reward prediction errors, driving reinforcement learning. In ambiguous settings, dopamine signals incorporate hidden state inference. We demonstrate that the medial prefrontal cortex is required for hidden state inference to influence dopamine signals, illuminating the neural circuit governing reinforcement learning under state uncertainty.
Central to the organization of behavior is the ability to predict the values of outcomes to guide choices. The accuracy of such predictions is honed by a teaching signal that indicates how incorrect ...a prediction was (“reward prediction error,” RPE). In several reinforcement learning contexts, such as Pavlovian conditioning and decisions guided by reward history, this RPE signal is provided by midbrain dopamine neurons. In many situations, however, the stimuli predictive of outcomes are perceptually ambiguous. Perceptual uncertainty is known to influence choices, but it has been unclear whether or how dopamine neurons factor it into their teaching signal. To cope with uncertainty, we extended a reinforcement learning model with a belief state about the perceptually ambiguous stimulus; this model generates an estimate of the probability of choice correctness, termed decision confidence. We show that dopamine responses in monkeys performing a perceptually ambiguous decision task comply with the model’s predictions. Consequently, dopamine responses did not simply reflect a stimulus’ average expected reward value but were predictive of the trial-to-trial fluctuations in perceptual accuracy. These confidence-dependent dopamine responses emerged prior to monkeys’ choice initiation, raising the possibility that dopamine impacts impending decisions, in addition to encoding a post-decision teaching signal. Finally, by manipulating reward size, we found that dopamine neurons reflect both the upcoming reward size and the confidence in achieving it. Together, our results show that dopamine responses convey teaching signals that are also appropriate for perceptual decisions.
•Reinforcement learning model with belief state to cope with perceptual uncertainty•Model provides unified account of dopamine in perceptual and reward-guided choices•Dopamine can act as a teaching signal during perceptual decision making as well•Dopamine signals decision confidence prior to behavioral manifestation of choice
Lak et al. show that dopamine neuron responses during a visual decision task comply with predictions of a reinforcement learning model with a belief state signaling confidence. The results reveal that dopamine neurons encode teaching signals appropriate for learning perceptual decisions and respond early enough to impact impending decisions.
Natural actor–critic algorithms Bhatnagar, Shalabh; Sutton, Richard S.; Ghavamzadeh, Mohammad ...
Automatica (Oxford),
11/2009, Volume:
45, Issue:
11
Journal Article
Peer reviewed
Open access
We present four new reinforcement learning algorithms based on actor–critic, natural-gradient and function-approximation ideas, and we provide their convergence proofs. Actor–critic reinforcement ...learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function-approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of special interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor–critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients. Our results extend prior empirical studies of natural actor–critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms.
This paper presents a computationally efficient smart home energy management system (SHEMS) using an approximate dynamic programming (ADP) approach with temporal difference learning for scheduling ...distributed energy resources. This approach improves the performance of an SHEMS by incorporating stochastic energy consumption and PV generation models over a horizon of several days, using only the computational power of existing smart meters. In this paper, we consider a PV-storage (thermal and battery) system, however, our method can extend to multiple controllable devices without the exponential growth in computation that other methods such as dynamic programming (DP) and stochastic mixed-integer linear programming (MILP) suffer from. Specifically, probability distributions associated with the PV output and demand are kernel estimated from empirical data collected during the Smart Grid Smart City project in NSW, Australia. Our results show that ADP computes a solution much faster than both DP and stochastic MILP, and provides only a slight reduction in quality compared to the optimal DP solution. In addition, incorporating a thermal energy storage unit using the proposed ADP-based SHEMS reduces the daily electricity cost by up to 26.3% without a noticeable increase in the computational burden. Moreover, ADP with a two-day decision horizon reduces the average yearly electricity cost by a 4.6% over a daily DP method, yet requires less than half of the computational effort.