论文标题
剪辑是否绑定概念?在大图模型中探测组成性
Does CLIP Bind Concepts? Probing Compositionality in Large Image Models
论文作者
论文摘要
近年来,结合文本和图像的大规模神经网络模型取得了令人难以置信的进步。但是,此类模型在多大程度上编码其操作的概念的组成表示在多大程度上仍然是一个悬而未决的问题,例如通过推理成分“红色”和“ Cube”来正确识别“红色立方体”。在这项工作中,我们专注于大型的视觉和语言模型(剪辑)编码组成概念并以结构敏感的方式结合变量(例如,从“球形”后面的“球形”背后区分“立方体”)。为了检查剪辑的性能,我们比较了有关组成分布语义模型(CDSM)研究的几个架构,该研究线研究试图在嵌入空间内实施传统的构图语言结构。我们在三个合成数据集(单对象,两对象和关系)上对其进行基准测试,以测试概念绑定。我们发现剪辑可以在单对象设置中构成概念,但是在需要概念绑定的情况下,性能会急剧下降。同时,CDSM的性能也很差,并且在机会级别上表现最佳。
Large-scale neural network models combining text and images have made incredible progress in recent years. However, it remains an open question to what extent such models encode compositional representations of the concepts over which they operate, such as correctly identifying "red cube" by reasoning over the constituents "red" and "cube". In this work, we focus on the ability of a large pretrained vision and language model (CLIP) to encode compositional concepts and to bind variables in a structure-sensitive way (e.g., differentiating "cube behind sphere" from "sphere behind cube"). To inspect the performance of CLIP, we compare several architectures from research on compositional distributional semantics models (CDSMs), a line of research that attempts to implement traditional compositional linguistic structures within embedding spaces. We benchmark them on three synthetic datasets - single-object, two-object, and relational - designed to test concept binding. We find that CLIP can compose concepts in a single-object setting, but in situations where concept binding is needed, performance drops dramatically. At the same time, CDSMs also perform poorly, with best performance at chance level.