论文标题
示威和好奇心的政策梯度
Policy Gradient from Demonstration and Curiosity
论文作者
论文摘要
通过加强学习,代理可以从任务的高级抽象中学习复杂的行为。但是,探索和奖励成型对于现有方法仍然具有挑战性,尤其是在外部反馈很少的情况下。已经研究了专家示范以解决这些困难,但是通常需要大量的高质量示威。在这项工作中,提出了一种综合的政策梯度算法来提高探索并促进仅从有限数量的示威活动中学习的内在奖励学习。我们通过用两个额外的术语对原始奖励函数进行重新重新调整来实现这一目标,其中第一个任期衡量了当前政策与专家之间的詹森 - 香农差异,第二个任期估计了代理商对环境的不确定性。对所提出的算法进行了一系列模拟任务的评估,这些算法具有稀疏的外部奖励信号,其中仅向每个任务提供了一个单一的轨迹,在所有任务中都证明了出色的探索效率和高平均回报。此外,发现代理商可以模仿专家的行为,同时维持高回报。
With reinforcement learning, an agent could learn complex behaviors from high-level abstractions of the task. However, exploration and reward shaping remained challenging for existing methods, especially in scenarios where the extrinsic feedback was sparse. Expert demonstrations have been investigated to solve these difficulties, but a tremendous number of high-quality demonstrations were usually required. In this work, an integrated policy gradient algorithm was proposed to boost exploration and facilitate intrinsic reward learning from only limited number of demonstrations. We achieved this by reformulating the original reward function with two additional terms, where the first term measured the Jensen-Shannon divergence between current policy and the expert, and the second term estimated the agent's uncertainty about the environment. The presented algorithm was evaluated on a range of simulated tasks with sparse extrinsic reward signals where only one single demonstrated trajectory was provided to each task, superior exploration efficiency and high average return were demonstrated in all tasks. Furthermore, it was found that the agent could imitate the expert's behavior and meanwhile sustain high return.