视觉接地的常识性知识获取

论文标题

视觉接地的常识性知识获取

Visually Grounded Commonsense Knowledge Acquisition

论文作者

Yao, Yuan, Yu, Tianyu, Zhang, Ao, Li, Mengdi, Xie, Ruobing, Weber, Cornelius, Liu, Zhiyuan, Zheng, Hai-Tao, Wermter, Stefan, Chua, Tat-Seng, Sun, Maosong

论文摘要

大规模常识性知识基础赋予了广泛的AI应用，在这种应用程序中，自动提取常识性知识（CKE）是一个基本且具有挑战性的问题。文本中的CKE以遭受固有的稀疏性和常识性偏见而闻名。另一方面，视觉感知包含有关现实世界实体的丰富常识性知识，例如（人，can_hold，瓶），可以作为获取接地常识知识的有前途的来源。在这项工作中，我们提出了聪明的情况，该工作将CKE作为一个遥远监督的多企业学习问题，模型学会从一袋有关实体对的图像中总结常识关系，而没有任何人类的注释。为了解决这个问题，聪明的人利用视觉语言的预训练模型来深入了解袋子中的每个图像，并通过新颖的对比性注意机制从袋子中选择信息的实例来总结常识实体关系。持有和人类评估的全面实验结果表明，聪明可以提取有希望的质量，超过3.9 AUC和6.4 MAUC点的预先培训的基于语言模型的方法。预测的常识分数与人类判断力与0.78 Spearman系数相关。此外，提取的常识也可以将其接地到具有合理解释性的图像中。可以在https://github.com/thunlp/clever上获得数据和代码。

Large-scale commonsense knowledge bases empower a broad range of AI applications, where the automatic extraction of commonsense knowledge (CKE) is a fundamental and challenging problem. CKE from text is known for suffering from the inherent sparsity and reporting bias of commonsense in text. Visual perception, on the other hand, contains rich commonsense knowledge about real-world entities, e.g., (person, can_hold, bottle), which can serve as promising sources for acquiring grounded commonsense knowledge. In this work, we present CLEVER, which formulates CKE as a distantly supervised multi-instance learning problem, where models learn to summarize commonsense relations from a bag of images about an entity pair without any human annotation on image instances. To address the problem, CLEVER leverages vision-language pre-training models for deep understanding of each image in the bag, and selects informative instances from the bag to summarize commonsense entity relations via a novel contrastive attention mechanism. Comprehensive experimental results in held-out and human evaluation show that CLEVER can extract commonsense knowledge in promising quality, outperforming pre-trained language model-based methods by 3.9 AUC and 6.4 mAUC points. The predicted commonsense scores show strong correlation with human judgment with a 0.78 Spearman coefficient. Moreover, the extracted commonsense can also be grounded into images with reasonable interpretability. The data and codes can be obtained at https://github.com/thunlp/CLEVER.

下载PDF全文

下载文献需遵守相关版权规定

论文标题