作者:Pitchaporn Rewatbowornwong Nattanat Chatthee Ekapol Chuangsuwanich Supasorn Suwajanakorn
CLIP已经实现了新的、令人兴奋的联合视觉语言应用,其中之一是开放式词汇分割,它可以在给定的词典文本查询中定位任何片段。在我们的研究中,我们问是否有可能在没有任何用户指导的情况下以文本查询或预定义类的形式发现语义片段,并使用自然语言自动标记它们?我们提出了一种新的问题零引导分割和第一个基线,该基线利用两个预先训练的广义模型DINO和CLIP来解决这个问题,而不需要任何微调或分割数据集。一般的想法是首先将图像分割成小块,将它们编码到CLIP的视觉语言空间中,将它们翻译成文本标签,并将语义相似的片段合并在一起。然而,关键的挑战是如何将视觉片段编码为特定片段的嵌入,以平衡全局和局部上下文信息,这两种信息对识别都很有用。我们的主要贡献是一部小说
CLIP has enabled new and exciting joint vision-language applications, one ofwhich is open-vocabulary segmentation, which can locate any segment given anarbitrary text query. In our research, we ask whether it is possible todiscover semantic segments without any user guidance in the form of textqueries or predefined classes, and label them using natural languageautomatically? We propose a novel problem zero-guidance segmentation and thefirst baseline that leverages two pre-trained generalist models, DINO and CLIP,to solve this problem without any fine-tuning or segmentation dataset. Thegeneral idea is to first segment an image into small over-segments, encode theminto CLIP’s visual-language space, translate them into text labels, and mergesemantically similar segments together. The key challenge, however, is how toencode a visual segment into a segment-specific embedding that balances globaland local context information, both useful for recognition. Our maincontribution is a novel attention-masking technique that balances the twocontexts by analyzing the attention layers inside CLIP. We also introduceseveral metrics for the evaluation of this new task. With CLIP’s innateknowledge, our method can precisely locate the Mona Lisa painting among amuseum crowd. Project page: https://zero-guide-seg.github.io/.
论文链接:http://arxiv.org/pdf/2303.13396v1
更多计算机论文:http://cspaper.cn/