脱机视觉表示的体现导航

论文标题

脱机视觉表示的体现导航

Offline Visual Representation Learning for Embodied Navigation

论文作者

Yadav, Karmesh, Ramrakhya, Ram, Majumdar, Arjun, Berges, Vincent-Pierre, Kuhar, Sachit, Batra, Dhruv, Baevski, Alexei, Maksymets, Oleksandr

论文摘要

我们应该如何学习必须看到和移动的体现代理的视觉表示？现状是tabula rasa in Vivo，即从头开始学习视觉表示，同时也学习移动，可能会增加辅助任务（例如，预测两个连续观察之间采取的动作）。在本文中，我们表明，另一种2阶段策略要有效得多：（1）使用室内环境（Omnidata）的大规模预渲染图像（Omnidata）的大规模预渲染图像对视觉表示（SSL），以及（2）长期以来图像增强时间表的特定任务的视觉现象表现形式（2）（2）（2）在线罚款。我们称此方法离线视觉表示学习（OVRL）。 We conduct large-scale experiments - on 3 different 3D datasets (Gibson, HM3D, MP3D), 2 tasks (ImageNav, ObjectNav), and 2 policy learning algorithms (RL, IL) - and find that the OVRL representations lead to significant across-the-board improvements in state of art, on ImageNav from 29.2% to 54.2% (+25% absolute, 86% relative) and on ObjectNav从18.1％到23.2％（绝对+5.1％，相对28％）。重要的是，通过相同的视觉编码器概括到训练期间看不到的数据集，可以实现这两个结果。尽管预处理的好处有时会随着长期的填充时间表而减少（或完全消失），但随着代理商经过20亿帧的经验培训，OVRL的性能增长继续增加（不是降低）。

How should we learn visual representations for embodied agents that must see and move? The status quo is tabula rasa in vivo, i.e. learning visual representations from scratch while also learning to move, potentially augmented with auxiliary tasks (e.g. predicting the action taken between two successive observations). In this paper, we show that an alternative 2-stage strategy is far more effective: (1) offline pretraining of visual representations with self-supervised learning (SSL) using large-scale pre-rendered images of indoor environments (Omnidata), and (2) online finetuning of visuomotor representations on specific tasks with image augmentations under long learning schedules. We call this method Offline Visual Representation Learning (OVRL). We conduct large-scale experiments - on 3 different 3D datasets (Gibson, HM3D, MP3D), 2 tasks (ImageNav, ObjectNav), and 2 policy learning algorithms (RL, IL) - and find that the OVRL representations lead to significant across-the-board improvements in state of art, on ImageNav from 29.2% to 54.2% (+25% absolute, 86% relative) and on ObjectNav from 18.1% to 23.2% (+5.1% absolute, 28% relative). Importantly, both results were achieved by the same visual encoder generalizing to datasets that were not seen during pretraining. While the benefits of pretraining sometimes diminish (or entirely disappear) with long finetuning schedules, we find that OVRL's performance gains continue to increase (not decrease) as the agent is trained for 2 billion frames of experience.

下载PDF全文

下载文献需遵守相关版权规定

论文标题