论文标题
Ernie-layout:布局知识增强了预训练,以了解视觉丰富的文档理解
ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding
论文作者
论文摘要
近年来,在视觉丰富的文档理解中,预训练技术的上升和成功。但是,大多数现有方法缺乏以布局为中心知识的系统采矿和利用,从而导致了次优的性能。在本文中,我们提出了Ernie-Layout,这是一种新颖的文档预训练解决方案,并在整个工作流程中增强了布局知识,以学习更好的表示形式,以结合文本,布局和图像的特征。具体而言,我们首先在序列化阶段重新排列输入序列,然后提出相关的预训练任务,即阅读顺序预测,以了解文档的正确阅读顺序。为了提高模型的布局意识,我们将空间感知的分散注意力集成到多模式变压器中,并替换为替换区域预测任务到训练阶段。实验结果表明,Ernie-Layout在各种下游任务上取得了卓越的性能,为关键信息提取,文档图像分类和文档询问数据集设置了新的最新时间。代码和模型可在http://github.com/paddlepaddle/paddlenlp/tree/develop/model_zoo/ernie-layout上公开获得。
Recent years have witnessed the rise and success of pre-training techniques in visually-rich document understanding. However, most existing methods lack the systematic mining and utilization of layout-centered knowledge, leading to sub-optimal performances. In this paper, we propose ERNIE-Layout, a novel document pre-training solution with layout knowledge enhancement in the whole workflow, to learn better representations that combine the features from text, layout, and image. Specifically, we first rearrange input sequences in the serialization stage, and then present a correlative pre-training task, reading order prediction, to learn the proper reading order of documents. To improve the layout awareness of the model, we integrate a spatial-aware disentangled attention into the multi-modal transformer and a replaced regions prediction task into the pre-training phase. Experimental results show that ERNIE-Layout achieves superior performance on various downstream tasks, setting new state-of-the-art on key information extraction, document image classification, and document question answering datasets. The code and models are publicly available at http://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-layout.