作者:Haoxuan You Mandy Guo Zhecan Wang Kai-Wei Chang Jason Baldridge Jiahui Yu
视觉和语言领域已经见证了预先训练的基础模型的激增。大多数现有的方法都是用对比目标(如CLIP)、图像到文本生成目标(如PaLI)或文本到图像生成目标(例如Parti)独立预训练的。然而,这三个目标可以在相同的数据、图像和文本对上进行预训练,直观地说,它们是互补的,因为对比提供了全局对齐能力,生成提供了细粒度的理解。在这项工作中,我们提出了一种对比双向图像文本生成模型(CoBIT),该模型试图将三个预训练目标统一在一个框架中。具体而言,CoBIT采用了一种新颖的unicoder解码器结构,由图像unicoder、文本unicoder和跨模态解码器组成。图像/文本单码器可以在不同任务中的编码和解码之间切换,增强灵活性和共享知识,有利于图像到文本和文本到图像的生成。圆面包
The field of vision and language has witnessed a proliferation of pre-trainedfoundation models. Most existing methods are independently pre-trained withcontrastive objective like CLIP, image-to-text generative objective like PaLI,or text-to-image generative objective like Parti. However, the three objectivescan be pre-trained on the same data, image-text pairs, and intuitively theycomplement each other as contrasting provides global alignment capacity andgeneration grants fine-grained understanding. In this work, we present aContrastive Bi-directional Image-Text generation model (CoBIT), which attemptsto unify the three pre-training objectives in one framework. Specifically,CoBIT employs a novel unicoder-decoder structure, consisting of an imageunicoder, a text unicoder and a cross-modal decoder. The image/text unicoderscan switch between encoding and decoding in different tasks, enablingflexibility and shared knowledge that benefits both image-to-text andtext-to-image generations. CoBIT achieves superior performance in imageunderstanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE)and text-based content creation, particularly in zero-shot scenarios. Forinstance, 82.7% in zero-shot ImageNet classification, 9.37 FID score inzero-shot text-to-image generation and 44.8 CIDEr in zero-shot captioning.
论文链接:http://arxiv.org/pdf/2303.13455v1
更多计算机论文:http://cspaper.cn/