模仿，快速和缓慢：通过决策时间计划从演示中学习

论文标题

模仿，快速和缓慢：通过决策时间计划从演示中学习

Imitating, Fast and Slow: Robust learning from demonstrations via decision-time planning

论文作者

Qi, Carl, Abbeel, Pieter, Grover, Aditya

论文摘要

模仿学习的目的是从示威中模仿专家行为，而无需获得明确的奖励信号。一种流行的方法通过逆增强学习（IRL）来渗透（未知）奖励功能，然后通过增强学习（RL）最大化此奖励功能。但是，通过这些方法学到的政策在实践中非常脆弱，即使由于复合错误而导致的测试时间扰动很小，也会迅速恶化。我们提出模仿测试时间的模仿（植入物），这是一种用于模仿学习的新元算法，它利用决策时间计划来纠正任何基本模仿政策的错误。与现有方法相反，我们在决策时间保留了模仿政策和奖励模型，从而受益于两个组成部分的学习信号。从经验上讲，我们证明植入物在标准控制环境上的基准模仿学习方法明显优于零摄像机的基准模仿学习方法，而在测试时间动力学中受到挑战性扰动时，植入物在零射门的概括方面表现出色。

The goal of imitation learning is to mimic expert behavior from demonstrations, without access to an explicit reward signal. A popular class of approach infers the (unknown) reward function via inverse reinforcement learning (IRL) followed by maximizing this reward function via reinforcement learning (RL). The policies learned via these approaches are however very brittle in practice and deteriorate quickly even with small test-time perturbations due to compounding errors. We propose Imitation with Planning at Test-time (IMPLANT), a new meta-algorithm for imitation learning that utilizes decision-time planning to correct for compounding errors of any base imitation policy. In contrast to existing approaches, we retain both the imitation policy and the rewards model at decision-time, thereby benefiting from the learning signal of the two components. Empirically, we demonstrate that IMPLANT significantly outperforms benchmark imitation learning approaches on standard control environments and excels at zero-shot generalization when subject to challenging perturbations in test-time dynamics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题