在增强学习中近似梯度以可微分质量多样性

论文标题

在增强学习中近似梯度以可微分质量多样性

Approximating Gradients for Differentiable Quality Diversity in Reinforcement Learning

论文作者

Tjanaka, Bryon, Fontaine, Matthew C., Togelius, Julian, Nikolaidis, Stefanos

论文摘要

考虑训练能力强大的代理的问题。一种方法是生成各种代理政策。然后，培训可以被视为质量多样性（QD）优化问题，在该问题中，我们在量化行为方面搜索了各种各样的策略的集合。最近的工作表明，当可用的梯度可用时，可区分质量多样性（DQD）算法大大加速了QD优化。但是，代理策略通常假定环境没有可区分。要将DQD算法应用于培训代理策略，我们必须近似绩效和行为的梯度。我们提出了通过加固学习（RL）中常见的近似方法计算梯度的当前最新DQD算法的两个变体。我们在四个模拟的运动任务上评估了我们的方法。一个变体的结果与结合QD和RL的当前最新面积相媲美，而另一个在两个运动任务中的表现相当。这些结果提供了有关必须近似梯度的域中当前DQD算法的局限性的洞察力。源代码可从https://github.com/icaros-usc/dqd-rl获得

Consider the problem of training robustly capable agents. One approach is to generate a diverse collection of agent polices. Training can then be viewed as a quality diversity (QD) optimization problem, where we search for a collection of performant policies that are diverse with respect to quantified behavior. Recent work shows that differentiable quality diversity (DQD) algorithms greatly accelerate QD optimization when exact gradients are available. However, agent policies typically assume that the environment is not differentiable. To apply DQD algorithms to training agent policies, we must approximate gradients for performance and behavior. We propose two variants of the current state-of-the-art DQD algorithm that compute gradients via approximation methods common in reinforcement learning (RL). We evaluate our approach on four simulated locomotion tasks. One variant achieves results comparable to the current state-of-the-art in combining QD and RL, while the other performs comparably in two locomotion tasks. These results provide insight into the limitations of current DQD algorithms in domains where gradients must be approximated. Source code is available at https://github.com/icaros-usc/dqd-rl

下载PDF全文

下载文献需遵守相关版权规定

论文标题