Since the late 1980s, temporal difference (TD) learning has dominated the research area of policy evaluation algorithms. However, the demand for the avoidance of TD defects, such as low ...data-efficiency and divergence in off-policy learning, has inspired the studies of a large number of novel TD-based approaches. Gradient-based and least-squares-based algorithms comprise the major part of these new approaches. This paper aims to combine advantages of these two categories to derive an efficient policy evaluation algorithm with <inline-formula> <tex-math notation="LaTeX">{O} </tex-math></inline-formula>(<inline-formula> <tex-math notation="LaTeX">{n} ^{ {2}} </tex-math></inline-formula>) per-time-step runtime complexity. The least-squares-based framework is adopted, and the gradient correction is used to improve convergence performance. This paper begins with the revision of a previous <inline-formula> <tex-math notation="LaTeX">{O} </tex-math></inline-formula>(<inline-formula> <tex-math notation="LaTeX">{n} ^{ {3}} </tex-math></inline-formula>) batch algorithm, least-squares TD with a gradient correction (LS-TDC) to regularize the parameter vector. Based on the recursive least-squares technique, an <inline-formula> <tex-math notation="LaTeX">{O} </tex-math></inline-formula>(<inline-formula> <tex-math notation="LaTeX">{n} ^{ {2}} </tex-math></inline-formula>) counterpart of LS-TDC called RC is proposed. To increase data efficiency, we generalize RC with eligibility traces. An off-policy extension is also proposed based on importance sampling. In addition, the convergence analysis for RC as well as LS-TDC is given. The empirical results in both on-policy and off-policy benchmarks show that RC has a higher estimation accuracy than that of RLSTD and a significantly lower runtime complexity than that of LSTDC.
Actor-critic based on the policy gradient (PG-based AC) methods have been widely studied to solve learning control problems. In order to increase the data efficiency of learning prediction in the ...critic of PG-based AC, studies on how to use recursive least-squares temporal difference (RLS-TD) algorithms for policy evaluation have been conducted in recent years. In such contexts, the critic RLS-TD evaluates an unknown mixed policy generated by a series of different actors, but not one fixed policy generated by the current actor. Therefore, this AC framework with RLS-TD critic cannot be proved to converge to the optimal fixed point of learning problem. To address the above problem, this paper proposes a new AC framework named critic-iteration PG (CIPG), which learns the state-value function of current policy in an on-policy way and performs gradient ascent in the direction of improving discounted total reward. During each iteration, CIPG keeps the policy parameters fixed and evaluates the resulting fixed policy by -regularized RLS-TD critic. Our convergence analysis extends previous convergence analysis of PG with function approximation to the case of RLS-TD critic. The simulation results demonstrate that the -regularization term in the critic of CIPG is undamped during the learning process, and CIPG has better learning efficiency and faster convergence rate than conventional AC learning control methods.
Actor-critic based on the policy gradient (PG-based AC) methods have been widely studied to solve learning control problems. In order to increase the data efficiency of learning prediction in the ...critic of PG-based AC, studies on how to use recursive least-squares temporal difference (RLS-TD) algorithms for policy evaluation have been conducted in recent years. In such contexts, the critic RLS-TD evaluates an unknown mixed policy generated by a series of different actors, but not one fixed policy generated by the current actor. Therefore, this AC framework with RLS-TD critic cannot be proved to converge to the optimal fixed point of learning problem. To address the above problem, this paper proposes a new AC framework named critic-iteration PG (CIPG), which learns the state-value function of current policy in an on-policy way and performs gradient ascent in the direction of improving discounted total reward. During each iteration, CIPG keeps the policy parameters fixed and evaluates the resulting fixed policy by <inline-formula> <tex-math notation="LaTeX">\ell _{2} </tex-math></inline-formula>-regularized RLS-TD critic. Our convergence analysis extends previous convergence analysis of PG with function approximation to the case of RLS-TD critic. The simulation results demonstrate that the <inline-formula> <tex-math notation="LaTeX">\ell _{2} </tex-math></inline-formula>-regularization term in the critic of CIPG is undamped during the learning process, and CIPG has better learning efficiency and faster convergence rate than conventional AC learning control methods.
The authors consider the task of learning control problem in reinforcement learning (RL) with continuous action space. Policy gradient, and in particular the determinist policy gradient (DPG) ...algorithm, provides a method for solving learning control problem with continuous action space. However, when the RL task is complex enough so that tuning of the function approximation is necessary, hand-tuning for the features is infeasible. In order to solve this problem, the authors extend DPG algorithm by adding an approximate-linear-dependency-based sparsification procedure, which makes DPG algorithm to automatically select the useful and sparse features. As far as the authors know, this is the first time to consider the feature selection problem in DPG. Simulation results illustrate that (i) the proposed algorithm can find the optimal solution of the continuous version of mountain car problem; (ii) the proposed algorithm achieves good performance over a large range of the approximate linear dependency threshold parameter settings.
We consider the tasks of feature selection and policy evaluation based on linear value function approximation in reinforcement learning problems. High-dimension feature vectors and limited number of ...samples can easily cause over-fitting and computation expensive. To prevent this problem, <inline-formula> <tex-math notation="LaTeX">{\ell _{1}} </tex-math></inline-formula>-regularized method obtains sparse solutions and thus improves generalization performance. We propose an efficient <inline-formula> <tex-math notation="LaTeX">{\ell _{1}} </tex-math></inline-formula>-regularized recursive least squares-based online algorithm with <inline-formula> <tex-math notation="LaTeX">{O} </tex-math></inline-formula>(<inline-formula> <tex-math notation="LaTeX">{n} ^{{2}} </tex-math></inline-formula>) complexity per time-step, termed <inline-formula> <tex-math notation="LaTeX">{\ell _{1}} </tex-math></inline-formula>-RC. With the help of nested optimization decomposition, <inline-formula> <tex-math notation="LaTeX">{\ell _{1}} </tex-math></inline-formula>-RC solves a series of standard optimization problems and avoids minimizing mean squares projected Bellman error with <inline-formula> <tex-math notation="LaTeX">{\ell _{1}} </tex-math></inline-formula>-regularization directly. In <inline-formula> <tex-math notation="LaTeX">{\ell _{1}} </tex-math></inline-formula>-RC, we propose RC with iterative refinement to minimize the operator error, and we propose an alternating direction method of multipliers with proximal operator to minimize the fixed-point error. The convergence of <inline-formula> <tex-math notation="LaTeX">{\ell _{1}} </tex-math></inline-formula>-RC is established based on ordinary differential equation method and some extensions are also given. In empirical computations, some state-of-the-art <inline-formula> <tex-math notation="LaTeX">{\ell _{1}} </tex-math></inline-formula>-regularized methods are chosen as the baselines, and <inline-formula> <tex-math notation="LaTeX">{\ell _{1}} </tex-math></inline-formula>-RC are tested in both policy evaluation and learning control benchmarks. The empirical results show the effectiveness and advantages of <inline-formula> <tex-math notation="LaTeX">{\ell _{1}} </tex-math></inline-formula>-RC.
An ℓ2-regularized policy evaluation algorithm, termed RRC (Regularized RC), is proposed for applying in the reinforcement learning problems. RBF network is used to construct VFA, and its weight ...vector is solved based on RC algorithm. An additional recursive step is used to achieve a different effect from traditional recursive least-square-based ℓ2-regularized algorithm: the regularization term does not decrease throughout learning. Additionally, a fast counterpart algorithm with O(n 2 ) complexity is also proposed, termed as Fast RRC (FRRC), which is more practical online algorithm than RRC. The convergence analysis and experiments results demonstrate the significant performances of RRC and FRRC.
Least-squares temporal difference learning (LSTD) has been used mainly for improving the data efficiency of the critic in actor-critic (AC). However, convergence analysis of the resulted algorithms ...is difficult when policy is changing. In this paper, a new AC method is proposed based on LSTD under discount criterion. The method comprises two components as the contribution: (1) LSTD works in an on-policy way to achieve a good convergence property of AC. (2) A sustainable ℓ 2 -regularization version of recursive LSTD, which is termed as RRLSTD, is proposed to solve the ℓ 2 -regularization problem of the critic in AC. To reduce the computation complexity of RRLSTD, we propose a fast version that is termed as FRRLSTD. Simulation results show that RRLSTD/FRRLSTD-based AC methods have better learning efficiency and faster convergence rate than conventional AC methods.
The task of epilepsy diagnosing in medicine by classification of electroencephao-graph (EEG) signals is considered. Since an EEG signal has a large number of dimensions as an input sample vector, ...many previous classification methods have been proposed as hybrid frameworks, which are structurally complex and computationally expensive. In this paper, convolutional neural network (CNN) is used to realize feature extraction and classification simultaneously. The scheme of CNN is adopted to overcome the curse of dimensionality. Meanwhile, the accelerated proximal gradient method is used to increase the training ratio. Experimental results show that the proposed method achieves ideal accuracy of epilepsy diagnosis and converges faster than CNNs based on traditional gradient back propagation.