论文标题
无模型的生成重播,用于终身增强学习:应用于Starcraft-2
Model-Free Generative Replay for Lifelong Reinforcement Learning: Application to Starcraft-2
论文作者
论文摘要
一种应对深层终身强化学习挑战(LRL)的挑战的方法是仔细管理代理商的学习经验,学习(不忘记)并建立内部元模型(任务,环境,代理和世界)。生成重播(GR)是一种以生物学启发的重播机制,可以通过从内部生成模型中提取的自标记示例来增强学习经验,该模型随着时间的推移而更新。我们提出了一个满足两个Desiderata的GR的版本:(a)使用Deep RL学习的策略的潜在策略的内省密度建模,以及(b)无模型的端到端学习。在本文中,我们研究了三种无模型GR的深度学习体系结构,从幼稚的GR开始,并添加成分以实现(a)和(b)。我们在三种不同的情况下评估了我们提出的算法,其中包括来自Starcraft-2和Minigrid域的任务。我们报告了几个关键发现,显示了设计选择对定量指标的影响,包括转移学习,对看不见的任务的概括,任务更改后快速改编,绩效WRT任务专家以及灾难性的遗忘。我们观察到我们的GR可以防止从深度RL代理的潜在矢量空间的功能映射中漂移。我们还显示了既定的终身学习指标的改进。我们发现,一个小的随机重放缓冲液显着提高了训练的稳定性。总体而言,我们发现“隐藏的重播”(一种众所周知的班级内部分类体系结构)是最有前途的方法,它推动了LRL的GR中最新技术,并且观察到睡眠模型的体系结构可能比使用的重播类型更重要。我们的实验仅需要6%的培训样本,以在大多数Starcraft-2方案中实现80-90%的专家绩效。
One approach to meet the challenges of deep lifelong reinforcement learning (LRL) is careful management of the agent's learning experiences, to learn (without forgetting) and build internal meta-models (of the tasks, environments, agents, and world). Generative replay (GR) is a biologically inspired replay mechanism that augments learning experiences with self-labelled examples drawn from an internal generative model that is updated over time. We present a version of GR for LRL that satisfies two desiderata: (a) Introspective density modelling of the latent representations of policies learned using deep RL, and (b) Model-free end-to-end learning. In this paper, we study three deep learning architectures for model-free GR, starting from a naïve GR and adding ingredients to achieve (a) and (b). We evaluate our proposed algorithms on three different scenarios comprising tasks from the Starcraft-2 and Minigrid domains. We report several key findings showing the impact of the design choices on quantitative metrics that include transfer learning, generalization to unseen tasks, fast adaptation after task change, performance wrt task expert, and catastrophic forgetting. We observe that our GR prevents drift in the features-to-action mapping from the latent vector space of a deep RL agent. We also show improvements in established lifelong learning metrics. We find that a small random replay buffer significantly increases the stability of training. Overall, we find that "hidden replay" (a well-known architecture for class-incremental classification) is the most promising approach that pushes the state-of-the-art in GR for LRL and observe that the architecture of the sleep model might be more important for improving performance than the types of replay used. Our experiments required only 6% of training samples to achieve 80-90% of expert performance in most Starcraft-2 scenarios.