使用弱衍生物进行加固学习的政策梯度

论文标题

使用弱衍生物进行加固学习的政策梯度

Policy Gradient using Weak Derivatives for Reinforcement Learning

论文作者

Bhatt, Sujay, Koppel, Alec, Krishnamurthy, Vikram

论文摘要

本文考虑了连续的州行动强化学习问题中的政策搜索。通常，一个人使用经典表达式计算搜索说明，用于策略梯度，称为策略梯度定理，该定理将值函数的梯度分解为两个因素：得分函数和Q功能。本文提出了四个结果：（i）建立了使用弱（测度值）衍生物而不是分数函数的替代策略梯度定理；（ii）因此得出的随机梯度估计值被证明是公正的，并产生算法，这些算法几乎可以肯定地融合到增强学习问题的非凸价值函数的固定点；（iii）算法的样本复杂性被得出，显示为$ o（1/\ sqrt（k））$; （iv）最后，使用弱衍生物获得的梯度估计的预期差异显示低于使用流行的分数函数方法获得的梯度估计。 Openai Gym钟摆环境的实验表现出拟议算法的出色性能。

This paper considers policy search in continuous state-action reinforcement learning problems. Typically, one computes search directions using a classic expression for the policy gradient called the Policy Gradient Theorem, which decomposes the gradient of the value function into two factors: the score function and the Q-function. This paper presents four results:(i) an alternative policy gradient theorem using weak (measure-valued) derivatives instead of score-function is established; (ii) the stochastic gradient estimates thus derived are shown to be unbiased and to yield algorithms that converge almost surely to stationary points of the non-convex value function of the reinforcement learning problem; (iii) the sample complexity of the algorithm is derived and is shown to be $O(1/\sqrt(k))$; (iv) finally, the expected variance of the gradient estimates obtained using weak derivatives is shown to be lower than those obtained using the popular score-function approach. Experiments on OpenAI gym pendulum environment show superior performance of the proposed algorithm.

下载PDF全文

下载文献需遵守相关版权规定

论文标题