非马克维亚政策占用措施

论文标题

非马克维亚政策占用措施

Non-Markovian policies occupancy measures

论文作者

Laroche, Romain, Combes, Remi Tachet des, Buckman, Jacob

论文摘要

马尔可夫政策是增强学习（RL）的主要研究对象，其中从无内存的概率分布中选择了代理商的行为，仅在其当前状态下进行。马尔可夫政策的家族足够广泛，可以有趣，但很简单，足以进行分析。但是，RL通常涉及更复杂的政策：政策的集合，对期权的政策，在线更新的政策等。我们的主要贡献是证明，任何非马克维亚政策的占用度量，即用它收集的过渡样本的分布，可以用马克维亚政策同等地产生。该结果使有关马尔可夫政策类的定理可以直接扩展到其非马克维亚对应物，极大地简化了证明，特别是涉及重播缓冲区和数据集的证据。我们为增强学习领域提供了各种此类应用的例子。

A central object of study in Reinforcement Learning (RL) is the Markovian policy, in which an agent's actions are chosen from a memoryless probability distribution, conditioned only on its current state. The family of Markovian policies is broad enough to be interesting, yet simple enough to be amenable to analysis. However, RL often involves more complex policies: ensembles of policies, policies over options, policies updated online, etc. Our main contribution is to prove that the occupancy measure of any non-Markovian policy, i.e., the distribution of transition samples collected with it, can be equivalently generated by a Markovian policy. This result allows theorems about the Markovian policy class to be directly extended to its non-Markovian counterpart, greatly simplifying proofs, in particular those involving replay buffers and datasets. We provide various examples of such applications to the field of Reinforcement Learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题