改进在线模仿学习的政策优化

论文标题

改进在线模仿学习的政策优化

Improved Policy Optimization for Online Imitation Learning

论文作者

Lavington, Jonathan Wilder, Vaswani, Sharan, Schmidt, Mark

论文摘要

我们考虑在线模仿学习（OIL），其中的任务是找到一种通过与环境的积极互动来模仿专家行为的政策。我们旨在通过分析最流行的石油算法之一匕首来弥合石油政策优化算法之间的差距。具体而言，如果一类政策足以包含专家政策，我们证明匕首会持续遗憾。与以前需要损失的界限不同，我们的结果仅需要较弱的假设，即损失相对于策略的足够统计数据（而不是其参数化）。为了确保对更广泛的政策和损失类别的收敛，我们以额外的正则化项增强了匕首。特别是，我们提出了一个遵循定制领导者（FTRL）的变体及其用于石油的自适应变体，并开发出记忆效率的实现，该变体与FTL的内存需求相匹配。假设损失功能是平稳的，并且相对于策略的参数凸出，我们还证明，FTRL对任何足够表达的政策类都持续遗憾，同时保留$ O（\ sqrt {t}）$在最差的案例中后悔。我们通过实验对合成和高维控制任务的实验证明了这些算法的有效性。

We consider online imitation learning (OIL), where the task is to find a policy that imitates the behavior of an expert via active interaction with the environment. We aim to bridge the gap between the theory and practice of policy optimization algorithms for OIL by analyzing one of the most popular OIL algorithms, DAGGER. Specifically, if the class of policies is sufficiently expressive to contain the expert policy, we prove that DAGGER achieves constant regret. Unlike previous bounds that require the losses to be strongly-convex, our result only requires the weaker assumption that the losses be strongly-convex with respect to the policy's sufficient statistics (not its parameterization). In order to ensure convergence for a wider class of policies and losses, we augment DAGGER with an additional regularization term. In particular, we propose a variant of Follow-the-Regularized-Leader (FTRL) and its adaptive variant for OIL and develop a memory-efficient implementation, which matches the memory requirements of FTL. Assuming that the loss functions are smooth and convex with respect to the parameters of the policy, we also prove that FTRL achieves constant regret for any sufficiently expressive policy class, while retaining $O(\sqrt{T})$ regret in the worst-case. We demonstrate the effectiveness of these algorithms with experiments on synthetic and high-dimensional control tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题