论文标题
部分可观测时空混沌系统的无模型预测
Linear convergence of a policy gradient method for some finite horizon continuous time control problems
论文作者
论文摘要
尽管它在强化学习社区中很受欢迎,但一种可证明的趋同的政策梯度方法,用于连续的时空控制问题,具有非线性状态动力学,这是难以捉摸的。本文提出了近端梯度算法,用于对有限时间地平线随机控制问题的反馈控制。状态动力学是具有控制膜漂移的非线性扩散,并且成本函数在状态中是非关节的,并且在控制中非平滑。系统噪声可以退化,这允许确定性控制问题作为特殊情况。我们在适当的条件下证明,算法将线性收敛到控制问题的固定点,并且在策略更新中通过近似梯度步骤稳定。融合结果证明,最近的强化学习启发式方法是,将熵正则化或虚拟折现因子添加到优化目标中加速了策略梯度方法的收敛性。证明利用了向后随机微分方程的仔细规律性估计。
Despite its popularity in the reinforcement learning community, a provably convergent policy gradient method for continuous space-time control problems with nonlinear state dynamics has been elusive. This paper proposes proximal gradient algorithms for feedback controls of finite-time horizon stochastic control problems. The state dynamics are nonlinear diffusions with control-affine drift, and the cost functions are nonconvex in the state and nonsmooth in the control. The system noise can degenerate, which allows for deterministic control problems as special cases. We prove under suitable conditions that the algorithm converges linearly to a stationary point of the control problem, and is stable with respect to policy updates by approximate gradient steps. The convergence result justifies the recent reinforcement learning heuristics that adding entropy regularization or a fictitious discount factor to the optimization objective accelerates the convergence of policy gradient methods. The proof exploits careful regularity estimates of backward stochastic differential equations.