非政策对抗性逆增强学习

论文标题

非政策对抗性逆增强学习

Off-Policy Adversarial Inverse Reinforcement Learning

论文作者

Arnob, Samin Yeasar

论文摘要

对抗性模仿学习（AIL）是强化学习（RL）的一类算法，该算法试图模仿专家而不从环境中获得任何奖励，并且不会直接向政策培训提供专家行为。相反，代理会学习一个策略分布，该策略分布最大程度地减少了对抗性环境中专家行为的差异。对抗性逆增强学习（AIRL）利用AIL的想法，整合了奖励功能近似以及学习政策，并在转移学习环境中显示了IRL的实用性。但是，在模仿任务中，启用转移学习的奖励函数近似值效果不佳。我们提出了一种非政策对抗性逆增强学习（OFFLICY-AIRL）算法，该算法是样本有效的，并且与连续控制任务中最先进的AIL AIL算法相比，具有良好的模仿性能。对于相同的奖励函数近似器，我们通过使用学习的奖励函数在没有专家示范的重大变化下对策略进行策略来研究我们的算法的实用性。

Adversarial Imitation Learning (AIL) is a class of algorithms in Reinforcement learning (RL), which tries to imitate an expert without taking any reward from the environment and does not provide expert behavior directly to the policy training. Rather, an agent learns a policy distribution that minimizes the difference from expert behavior in an adversarial setting. Adversarial Inverse Reinforcement Learning (AIRL) leverages the idea of AIL, integrates a reward function approximation along with learning the policy, and shows the utility of IRL in the transfer learning setting. But the reward function approximator that enables transfer learning does not perform well in imitation tasks. We propose an Off-Policy Adversarial Inverse Reinforcement Learning (Off-policy-AIRL) algorithm which is sample efficient as well as gives good imitation performance compared to the state-of-the-art AIL algorithm in the continuous control tasks. For the same reward function approximator, we show the utility of learning our algorithm over AIL by using the learned reward function to retrain the policy over a task under significant variation where expert demonstrations are absent.

下载PDF全文

下载文献需遵守相关版权规定

论文标题