政策评估网络

论文标题

政策评估网络

Policy Evaluation Networks

论文作者

Harb, Jean, Schaul, Tom, Precup, Doina, Bacon, Pierre-Luc

论文摘要

许多强化学习算法使用价值功能来指导搜索更好的政策。这些方法在概括许多州的同时估计单个政策的价值。本文的核心思想是翻转本惯例，并估算一组国家的许多政策的价值。这种方法打开了在政策空间中执行直接梯度上升的可能性，而无需看到任何新数据。这种方法的主要挑战是找到一种代表促进学习和概括的复杂政策的方法。为了解决这个问题，我们引入了一种可扩展的，可区分的指纹机制，该机制保留了简洁的嵌入中的基本政策信息。我们的经验结果表明，将这三个要素（学习的政策评估网络，策略指纹，梯度上升）结合起来可以制定策略，以零拍的方式优于生成培训数据的人。

Many reinforcement learning algorithms use value functions to guide the search for better policies. These methods estimate the value of a single policy while generalizing across many states. The core idea of this paper is to flip this convention and estimate the value of many policies, for a single set of states. This approach opens up the possibility of performing direct gradient ascent in policy space without seeing any new data. The main challenge for this approach is finding a way to represent complex policies that facilitates learning and generalization. To address this problem, we introduce a scalable, differentiable fingerprinting mechanism that retains essential policy information in a concise embedding. Our empirical results demonstrate that combining these three elements (learned Policy Evaluation Network, policy fingerprints, gradient ascent) can produce policies that outperform those that generated the training data, in zero-shot manner.

下载PDF全文

下载文献需遵守相关版权规定

论文标题