作者:Relja Arandjelović Alex Andonian Arthur Mensch Olivier J. Hénaff Jean-Baptiste Alayrac Andrew Zisserman
零样本开放词汇检测的核心问题是如何对齐视觉和文本特征,以便检测器能够很好地处理不可见的类。以前的方法从头开始训练特征金字塔和检测头,这打破了预训练过程中建立的视觉-文本特征对齐,并努力防止语言模型忘记不可见类。我们提出了三种方法来缓解这些问题。首先,使用一个简单的方案来增强文本嵌入,防止过度拟合到训练过程中看到的少量类,同时节省内存和计算。其次,对特征金字塔网络和检测头进行了修改,使其包含可训练的门控快捷方式,这鼓励了视觉-文本特征对齐,并在检测训练开始时保证了这一点。最后,使用自训练方法来利用更大的图像-文本对语料库,从而提高了在没有人类注释边界的类上的检测性能
The core problem in zero-shot open vocabulary detection is how to alignvisual and text features, so that the detector performs well on unseen classes.Previous approaches train the feature pyramid and detection head from scratch,which breaks the vision-text feature alignment established during pretraining,and struggles to prevent the language model from forgetting unseen classes. We propose three methods to alleviate these issues. Firstly, a simple schemeis used to augment the text embeddings which prevents overfitting to a smallnumber of classes seen during training, while simultaneously saving memory andcomputation. Secondly, the feature pyramid network and the detection head aremodified to include trainable gated shortcuts, which encourages vision-textfeature alignment and guarantees it at the start of detection training.Finally, a self-training approach is used to leverage a larger corpus ofimage-text pairs thus improving detection performance on classes with no humanannotated bounding boxes. Our three methods are evaluated on the zero-shot version of the LVISbenchmark, each of them showing clear and significant benefits. Our finalnetwork achieves the new stateof-the-art on the mAP-all metric and demonstratescompetitive performance for mAP-rare, as well as superior transfer to COCO andObjects365.
论文链接:http://arxiv.org/pdf/2303.13518v1
更多计算机论文:http://cspaper.cn/