Akademska digitalna zbirka SLovenije - logo
E-resources
Full text
  • Actor-Critic Learning Contr...
    Li, Luntong; Li, Dazi; Song, Tianheng; Xu, Xin

    IEEE transaction on neural networks and learning systems, 2018-Dec., 2018-12-00, Volume: 29, Issue: 12
    Journal Article

    Actor-critic based on the policy gradient (PG-based AC) methods have been widely studied to solve learning control problems. In order to increase the data efficiency of learning prediction in the critic of PG-based AC, studies on how to use recursive least-squares temporal difference (RLS-TD) algorithms for policy evaluation have been conducted in recent years. In such contexts, the critic RLS-TD evaluates an unknown mixed policy generated by a series of different actors, but not one fixed policy generated by the current actor. Therefore, this AC framework with RLS-TD critic cannot be proved to converge to the optimal fixed point of learning problem. To address the above problem, this paper proposes a new AC framework named critic-iteration PG (CIPG), which learns the state-value function of current policy in an on-policy way and performs gradient ascent in the direction of improving discounted total reward. During each iteration, CIPG keeps the policy parameters fixed and evaluates the resulting fixed policy by <inline-formula> <tex-math notation="LaTeX">\ell _{2} </tex-math></inline-formula>-regularized RLS-TD critic. Our convergence analysis extends previous convergence analysis of PG with function approximation to the case of RLS-TD critic. The simulation results demonstrate that the <inline-formula> <tex-math notation="LaTeX">\ell _{2} </tex-math></inline-formula>-regularization term in the critic of CIPG is undamped during the learning process, and CIPG has better learning efficiency and faster convergence rate than conventional AC learning control methods.