LISPR：通过加强学习的政策重复使用的选项框架

论文标题

LISPR：通过加强学习的政策重复使用的选项框架

LISPR: An Options Framework for Policy Reuse with Reinforcement Learning

论文作者

Graves, Daniel, Jin, Jun, Luo, Jun

论文摘要

我们提出了一个将任何现有策略从潜在未知的源MDP转移到目标MDP的框架。该框架（1）可以在任何形式的源政策的目标域中重复使用，包括经典控制器，启发式策略或基于DEAP神经网络的策略，（2）在适当的理论条件下达到最佳性，（（3）保证对目标MDP中源策略的改进。这些是通过将源策略作为目标MDP中的黑框选项打包而实现的，并提供了一种理论上扎根的方法，可以通过一般值函数学习选项的启动。我们的方法通过（1）在Black-Box选项的帮助下通过（1）最大化目标MDP奖励来促进新策略的学习，并且（2）将代理商在Black-box选项的学习启动集中返回州的状态。我们表明，在某些条件下，这两个变体的性能等效。通过在模拟环境中进行的一系列实验，我们证明了我们的框架在给定（亚）最佳源策略的稀疏奖励问题中表现出色，并改善了先前的转移方法（例如持续学习和渐进式网络），这缺乏我们框架的理论特性。

We propose a framework for transferring any existing policy from a potentially unknown source MDP to a target MDP. This framework (1) enables reuse in the target domain of any form of source policy, including classical controllers, heuristic policies, or deep neural network-based policies, (2) attains optimality under suitable theoretical conditions, and (3) guarantees improvement over the source policy in the target MDP. These are achieved by packaging the source policy as a black-box option in the target MDP and providing a theoretically grounded way to learn the option's initiation set through general value functions. Our approach facilitates the learning of new policies by (1) maximizing the target MDP reward with the help of the black-box option, and (2) returning the agent to states in the learned initiation set of the black-box option where it is already optimal. We show that these two variants are equivalent in performance under some conditions. Through a series of experiments in simulated environments, we demonstrate that our framework performs excellently in sparse reward problems given (sub-)optimal source policies and improves upon prior art in transfer methods such as continual learning and progressive networks, which lack our framework's desirable theoretical properties.

下载PDF全文

下载文献需遵守相关版权规定

论文标题