In recent years, neural networks have enjoyed a renaissance as function approximators in reinforcement learning. Two decades after Tesauro’s TD-Gammon achieved near top-level human performance in ...backgammon, the deep reinforcement learning algorithm DQN achieved human-level performance in many Atari 2600 games. The purpose of this study is twofold. First, we propose two activation functions for neural network function approximation in reinforcement learning: the sigmoid-weighted linear unit (SiLU) and its derivative function (dSiLU). The activation of the SiLU is computed by the sigmoid function multiplied by its input. Second, we suggest that the more traditional approach of using on-policy learning with eligibility traces, instead of experience replay, and softmax action selection can be competitive with DQN, without the need for a separate target network. We validate our proposed approach by, first, achieving new state-of-the-art results in both stochastic SZ-Tetris and Tetris with a small 10 × 10 board, using TD(λ) learning and shallow dSiLU network agents, and, then, by outperforming DQN in the Atari 2600 domain by using a deep Sarsa(λ) agent with SiLU and dSiLU hidden units.
•The duality of sensory inference and optimal control has been known since 1960s.•The duality stems from the common computations for the posterior distribution in dynamic Bayesian inference and the ...value function in reinforcement learning.•We consider the hypothesis that the sensory and motor cortical circuits implement dual computations of sensory inference and optimal control, or perceptual and value-based decision making, respectively.
The duality of sensory inference and motor control has been known since the 1960s and has recently been recognized as the commonality in computations required for the posterior distributions in Bayesian inference and the value functions in optimal control. Meanwhile, an intriguing question about the brain is why the entire neocortex shares a canonical six-layer architecture while its posterior and anterior halves are engaged in sensory processing and motor control, respectively.
Here we consider the hypothesis that the sensory and motor cortical circuits implement the dual computations for Bayesian inference and optimal control, or perceptual and value-based decision making, respectively. We first review the classic duality of inference and control in linear quadratic systems and then review the correspondence between dynamic Bayesian inference and optimal control. Based on the architecture of the canonical cortical circuit, we explore how different cortical neurons may represent variables and implement computations.
This paper proposes model-free imitation learning named Entropy-Regularized Imitation Learning (ERIL) that minimizes the reverse Kullback–Leibler (KL) divergence. ERIL combines forward and inverse ...reinforcement learning (RL) under the framework of an entropy-regularized Markov decision process. An inverse RL step computes the log-ratio between two distributions by evaluating two binary discriminators. The first discriminator distinguishes the state generated by the forward RL step from the expert’s state. The second discriminator, which is structured by the theory of entropy regularization, distinguishes the state–action–next-state tuples generated by the learner from the expert ones. One notable feature is that the second discriminator shares hyperparameters with the forward RL, which can be used to control the discriminator’s ability. A forward RL step minimizes the reverse KL estimated by the inverse RL step. We show that minimizing the reverse KL divergence is equivalent to finding an optimal policy. Our experimental results on MuJoCo-simulated environments and vision-based reaching tasks with a robotic arm show that ERIL is more sample-efficient than the baseline methods. We apply the method to human behaviors that perform a pole-balancing task and describe how the estimated reward functions show how every subject achieves her goal.
•Developed model-free imitation learning by forward and inverse reinforcement learning.•Forward and inverse reinforcement learning share neural network weights and hyperparameters.•Derived a structured discriminator with hyperparameters using entropy regularization.•Both forward and inverse reinforcement learning update the state value function.
The striatum is a major input site of the basal ganglia, which play an essential role in decision making. Previous studies have suggested that subareas of the striatum have distinct roles: the ...dorsolateral striatum (DLS) functions in habitual action, the dorsomedial striatum (DMS) in goal-directed actions, and the ventral striatum (VS) in motivation. To elucidate distinctive functions of subregions of the striatum in decision making, we systematically investigated information represented by phasically active neurons in DLS, DMS, and VS. Rats performed two types of choice tasks: fixed- and free-choice tasks. In both tasks, rats were required to perform nose poking to either the left or right hole after cue-tone presentation. A food pellet was delivered probabilistically depending on the presented cue and the selected action. The reward probability was fixed in fixed-choice task and varied in a block-wise manner in free-choice task. We found the following: (1) when rats began the tasks, a majority of VS neurons increased their firing rates and information regarding task type and state value was most strongly represented in VS; (2) during action selection, information of action and action values was most strongly represented in DMS; (3) action-command information (action representation before action selection) was stronger in the fixed-choice task than in the free-choice task in both DLS and DMS; and (4) action-command information was strongest in DLS, particularly when the same choice was repeated. We propose a hypothesis of hierarchical reinforcement learning in the basal ganglia to coherently explain these results.
A major advance in understanding learning behavior stems from experiments showing that reward learning requires dopamine inputs to striatal neurons and arises from synaptic plasticity of ...cortico-striatal synapses. Numerous reinforcement learning models mimic this dopamine-dependent synaptic plasticity by using the reward prediction error, which resembles dopamine neuron firing, to learn the best action in response to a set of cues. Though these models can explain many facets of behavior, reproducing some types of goal-directed behavior, such as renewal and reversal, require additional model components. Here we present a reinforcement learning model, TD2Q, which better corresponds to the basal ganglia with two Q matrices, one representing direct pathway neurons (G) and another representing indirect pathway neurons (N). Unlike previous two-Q architectures, a novel and critical aspect of TD2Q is to update the G and N matrices utilizing the temporal difference reward prediction error. A best action is selected for N and G using a softmax with a reward-dependent adaptive exploration parameter, and then differences are resolved using a second selection step applied to the two action probabilities. The model is tested on a range of multi-step tasks including extinction, renewal, discrimination; switching reward probability learning; and sequence learning. Simulations show that TD2Q produces behaviors similar to rodents in choice and sequence learning tasks, and that use of the temporal difference reward prediction error is required to learn multi-step tasks. Blocking the update rule on the N matrix blocks discrimination learning, as observed experimentally. Performance in the sequence learning task is dramatically improved with two matrices. These results suggest that including additional aspects of basal ganglia physiology can improve the performance of reinforcement learning models, better reproduce animal behaviors, and provide insight as to the role of direct- and indirect-pathway striatal neurons.
Reinforcement learning theory plays a key role in understanding the behavioral and neural mechanisms of choice behavior in animals and humans. Especially, intermediate variables of learning models ...estimated from behavioral data, such as the expectation of reward for each candidate choice (action value), have been used in searches for the neural correlates of computational elements in learning and decision making. The aims of the present study are as follows: (1) to test which computational model best captures the choice learning process in animals and (2) to elucidate how action values are represented in different parts of the corticobasal ganglia circuit. We compared different behavioral learning algorithms to predict the choice sequences generated by rats during a free-choice task and analyzed associated neural activity in the nucleus accumbens (NAc) and ventral pallidum (VP). The major findings of this study were as follows: (1) modified versions of an action-value learning model captured a variety of choice strategies of rats, including win-stay-lose-switch and persevering behavior, and predicted rats' choice sequences better than the best multistep Markov model; and (2) information about action values and future actions was coded in both the NAc and VP, but was less dominant than information about trial types, selected actions, and reward outcome. The results of our model-based analysis suggest that the primary role of the NAc and VP is to monitor information important for updating choice behaviors. Information represented in the NAc and VP might contribute to a choice mechanism that is situated elsewhere.
Recent experiments have shown that optogenetic activation of serotonin neurons in the dorsal raphe nucleus (DRN) in mice enhances patience in waiting for future rewards. Here, we show that serotonin ...effect in promoting waiting is maximized by both high probability and high timing uncertainty of reward. Optogenetic activation of serotonergic neurons prolongs waiting time in no-reward trials in a task with 75% food reward probability, but not with 50 or 25% reward probabilities. Serotonin effect in promoting waiting increases when the timing of reward presentation becomes unpredictable. To coherently explain the experimental data, we propose a Bayesian decision model of waiting that assumes that serotonin neuron activation increases the prior probability or subjective confidence of reward delivery. The present data and modeling point to the possibility of a generalized role of serotonin in resolving trade-offs, not only between immediate and delayed rewards, but also between sensory evidence and subjective confidence.
Behaving efficiently and flexibly is crucial for biological and artificial embodied agents. Behavior is generally classified into two types: habitual (fast but inflexible), and goal-directed ...(flexible but slow). While these two types of behaviors are typically considered to be managed by two distinct systems in the brain, recent studies have revealed a more sophisticated interplay between them. We introduce a theoretical framework using variational Bayesian theory, incorporating a Bayesian intention variable. Habitual behavior depends on the prior distribution of intention, computed from sensory context without goal-specification. In contrast, goal-directed behavior relies on the goal-conditioned posterior distribution of intention, inferred through variational free energy minimization. Assuming that an agent behaves using a synergized intention, our simulations in vision-based sensorimotor tasks explain the key properties of their interaction as observed in experiments. Our work suggests a fresh perspective on the neural mechanisms of habits and goals, shedding light on future research in decision making.
Dynamic Bayesian inference allows a system to infer the environmental state under conditions of limited sensory observation. Using a goal-reaching task, we found that posterior parietal cortex (PPC) ...and adjacent posteromedial cortex (PM) implemented the two fundamental features of dynamic Bayesian inference: prediction of hidden states using an internal state transition model and updating the prediction with new sensory evidence. We optically imaged the activity of neurons in mouse PPC and PM layers 2, 3 and 5 in an acoustic virtual-reality system. As mice approached a reward site, anticipatory licking increased even when sound cues were intermittently presented; this was disturbed by PPC silencing. Probabilistic population decoding revealed that neurons in PPC and PM represented goal distances during sound omission (prediction), particularly in PPC layers 3 and 5, and prediction improved with the observation of cue sounds (updating). Our results illustrate how cerebral cortex realizes mental simulation using an action-dependent dynamic model.
This article presents a reinforcement learning framework for continuous-time dynamical systems without a priori discretization of time, state, and action. Basedonthe Hamilton-Jacobi-Bellman (HJB) ...equation for infinite-horizon, discounted reward problems, we derive algorithms for estimating value functions and improving policies with the use of function approximators. The process of value function estimation is formulated as the minimization of a continuous-time form of the temporal difference (TD) error. Update methods based on backward Euler approximation and exponential eligibility traces are derived, and their correspondences with the conventional residual gradient, TD (0), and TD (
) algorithms are shown. For policy improvement, two methods—a continuous actor-critic method and a value-gradient-based greedy policy—are formulated. As a special case of the latter, a nonlinear feedback control law using the value gradient and the model of the input gain is derived. The advantage updating, a model-free algorithm derived previously, is also formulated in the HJB-based framework.
The performance of the proposed algorithms is first tested in a nonlinear control task of swinging a pendulum up with limited torque. It is shown in the simulations that (1) the task is accomplished by the continuous actor-critic method in a number of trials several times fewer than by the conventional discrete actor-critic method; (2) among the continuous policy update methods, the value-gradient-based policy with a known or learned dynamic model performs several times better than the actor-critic method; and (3) a value function update using exponential eligibility traces is more efficient and stable than that based on Euler approximation. The algorithms are then tested in a higher-dimensional task: cart-pole swing-up. This task is accomplished in several hundred trials using the value-gradient-based policy with a learned dynamic model.