论文标题
将视觉语义纳入接地空间内的句子表示
Incorporating Visual Semantics into Sentence Representations within a Grounded Space
论文作者
论文摘要
语言接地是一个活跃的领域,旨在通过视觉信息丰富文本表示。通常,文本和视觉元素嵌入在相同的表示空间中,该空间隐含地假设模式之间的一对一对应关系。该假设在表示单词时不存在,并且在学习句子表示时会变得有问题 - 本文的重点 - 作为视觉场景可以用各种句子来描述。为了克服这一限制,我们建议通过学习中间表示空间:接地空间将视觉信息传输到文本表示。我们进一步提出了两个新的补充目标,以确保(1)与相同视觉内容相关的句子在接地空间中紧密接近,并且(2)相关元素之间的相似性保留了跨模态。我们表明,该模型的表现优于先前关于分类和语义相关性任务的最先进。
Language grounding is an active field aiming at enriching textual representations with visual information. Generally, textual and visual elements are embedded in the same representation space, which implicitly assumes a one-to-one correspondence between modalities. This hypothesis does not hold when representing words, and becomes problematic when used to learn sentence representations --- the focus of this paper --- as a visual scene can be described by a wide variety of sentences. To overcome this limitation, we propose to transfer visual information to textual representations by learning an intermediate representation space: the grounded space. We further propose two new complementary objectives ensuring that (1) sentences associated with the same visual content are close in the grounded space and (2) similarities between related elements are preserved across modalities. We show that this model outperforms the previous state-of-the-art on classification and semantic relatedness tasks.