论文标题
用程序克隆的思想模仿链
Chain of Thought Imitation with Procedure Cloning
论文作者
论文摘要
模仿学习旨在从记录的专家行为演示中提取高性能政策。通常,将模仿学习作为一个有监督的学习问题,在该问题中,人们将函数近似值拟合到已记录的演示(输入观察到输出动作)所展示的输入输出映射。虽然模仿学习作为监督输入输出学习问题的框架允许在各种环境中适用,但在专家演示提供了对专家行为更加丰富的洞察力的情况下,这也是对问题的过于简单的看法。例如,诸如路径导航,机器人操纵和战略游戏之类的应用程序通过计划,搜索或其他一些多步算法获取专家演示,不仅揭示了要模仿的输出操作,还揭示了如何确定此操作的过程。尽管这些中间计算可能会在推理过程中使用代理无法使用的工具(例如,环境模拟器),但它们却提供了信息,作为解释专家将状态映射到行动的一种方式。为了适当利用专家程序信息,而无需依靠专家可能用于执行该过程的特权工具,我们提出了程序克隆,该程序应用了监督的序列预测以模仿一系列专家计算。这样,程序克隆不仅可以学习该怎么做(即输出操作),还可以学习如何以及为什么要做(即过程)。通过对导航,模拟机器人操作和游戏玩法环境的经验分析,我们表明,模仿专家行为的中间计算,使程序克隆能够学习具有重要概括性的政策,以表现出不见了的环境配置,其中包括直接运行专家程序的那些配置,这是不可分割的。
Imitation learning aims to extract high-performance policies from logged demonstrations of expert behavior. It is common to frame imitation learning as a supervised learning problem in which one fits a function approximator to the input-output mapping exhibited by the logged demonstrations (input observations to output actions). While the framing of imitation learning as a supervised input-output learning problem allows for applicability in a wide variety of settings, it is also an overly simplistic view of the problem in situations where the expert demonstrations provide much richer insight into expert behavior. For example, applications such as path navigation, robot manipulation, and strategy games acquire expert demonstrations via planning, search, or some other multi-step algorithm, revealing not just the output action to be imitated but also the procedure for how to determine this action. While these intermediate computations may use tools not available to the agent during inference (e.g., environment simulators), they are nevertheless informative as a way to explain an expert's mapping of state to actions. To properly leverage expert procedure information without relying on the privileged tools the expert may have used to perform the procedure, we propose procedure cloning, which applies supervised sequence prediction to imitate the series of expert computations. This way, procedure cloning learns not only what to do (i.e., the output action), but how and why to do it (i.e., the procedure). Through empirical analysis on navigation, simulated robotic manipulation, and game-playing environments, we show that imitating the intermediate computations of an expert's behavior enables procedure cloning to learn policies exhibiting significant generalization to unseen environment configurations, including those configurations for which running the expert's procedure directly is infeasible.