论文标题
评估视觉变压器方法,用于从像素中进行深度加固学习
Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels
论文作者
论文摘要
视觉变压器(VIT)最近证明了变压器体系结构对计算机视觉的重要潜力。与标准卷积神经网络(CNN)体系结构相比,基于图像的深度强化学习在多大程度上也可以从VIT体系结构中受益?为了回答这个问题,我们评估了基于图像的增强学习(RL)控制任务的VIT培训方法,并将这些结果与领先的卷积网络结构方法进行比较。对于培训VIT编码器,我们考虑了几种最近提供的自我监督损失,这些损失被视为辅助任务,以及没有其他损失条款的基线。我们发现,使用RAD训练的CNN体系结构通常仍然提供出色的性能。对于VIT方法,我们认为所有三种类型的辅助任务都比纯VIT培训提供了好处。此外,发现基于VIT重建的任务显着优于VIT对比学习。
Vision Transformers (ViT) have recently demonstrated the significant potential of transformer architectures for computer vision. To what extent can image-based deep reinforcement learning also benefit from ViT architectures, as compared to standard convolutional neural network (CNN) architectures? To answer this question, we evaluate ViT training methods for image-based reinforcement learning (RL) control tasks and compare these results to a leading convolutional-network architecture method, RAD. For training the ViT encoder, we consider several recently-proposed self-supervised losses that are treated as auxiliary tasks, as well as a baseline with no additional loss terms. We find that the CNN architectures trained using RAD still generally provide superior performance. For the ViT methods, all three types of auxiliary tasks that we consider provide a benefit over plain ViT training. Furthermore, ViT reconstruction-based tasks are found to significantly outperform ViT contrastive-learning.