论文标题
感知器演员:用于机器人操作的多任务变压器
Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation
论文作者
论文摘要
变形金刚用大型数据集的扩展能力彻底改变了视力和自然语言处理。但是在机器人操作中,数据既有限又昂贵。操纵能否仍然受益于具有正确问题提出的变压器?我们使用Peract进行调查,这是一种用于多任务6 DOF操纵的语言条件的行为结合剂。 Peract用感知器变压器编码语言目标和RGB-D Voxel观测值,并通过``检测下一个最佳的Voxel Action''来输出离散的动作。与在2D图像上运行的框架不同,Voxelized 3D观测和动作空间为有效学习6-DOF动作提供了强大的结构性。通过此公式,我们训练一个单个多任务变压器,用于18个RLBENCH任务(具有249个变体)和7个现实世界任务(具有18个变体),从每个任务仅几个演示。我们的结果表明,针对各种桌面任务,佩内的磨损明显优于非结构化图像到作用剂和3D Convnet基准。
Transformers have revolutionized vision and natural language processing with their ability to scale with large datasets. But in robotic manipulation, data is both limited and expensive. Can manipulation still benefit from Transformers with the right problem formulation? We investigate this question with PerAct, a language-conditioned behavior-cloning agent for multi-task 6-DoF manipulation. PerAct encodes language goals and RGB-D voxel observations with a Perceiver Transformer, and outputs discretized actions by ``detecting the next best voxel action''. Unlike frameworks that operate on 2D images, the voxelized 3D observation and action space provides a strong structural prior for efficiently learning 6-DoF actions. With this formulation, we train a single multi-task Transformer for 18 RLBench tasks (with 249 variations) and 7 real-world tasks (with 18 variations) from just a few demonstrations per task. Our results show that PerAct significantly outperforms unstructured image-to-action agents and 3D ConvNet baselines for a wide range of tabletop tasks.