作者:Haiwen Diao Ying Zhang Wei Liu Xiang Ruan Huchuan Lu
利用细粒度的对应关系和视觉语义比对在图像-文本匹配中显示出巨大的潜力。一般来说,最近的方法首先使用跨模态注意力单元来捕捉潜在的区域-单词交互,然后整合所有的比对以获得最终相似性。然而,它们大多采用具有复杂架构或附加信息的一次性前向关联或聚合策略,而忽略了网络反馈的调节能力。在本文中,我们开发了两个简单但相当有效的调节器,它们可以有效地对消息输出进行编码,以自动将跨模式表示上下文化和聚合。具体而言,我们提出了(i)一种递归对应调节器(RCR),它通过自适应注意力因子促进跨模态注意力单元的渐进性,以捕获更灵活的对应关系;以及(ii)一种循环聚合调节器(RAR),该调节器反复调整聚合权重,使其越来越有效
Exploiting fine-grained correspondence and visual-semantic alignments hasshown great potential in image-text matching. Generally, recent approachesfirst employ a cross-modal attention unit to capture latent region-wordinteractions, and then integrate all the alignments to obtain the finalsimilarity. However, most of them adopt one-time forward association oraggregation strategies with complex architectures or additional information,while ignoring the regulation ability of network feedback. In this paper, wedevelop two simple but quite effective regulators which efficiently encode themessage output to automatically contextualize and aggregate cross-modalrepresentations. Specifically, we propose (i) a Recurrent CorrespondenceRegulator (RCR) which facilitates the cross-modal attention unit progressivelywith adaptive attention factors to capture more flexible correspondence, and(ii) a Recurrent Aggregation Regulator (RAR) which adjusts the aggregationweights repeatedly to increasingly emphasize important alignments and diluteunimportant ones. Besides, it is interesting that RCR and RAR areplug-and-play: both of them can be incorporated into many frameworks based oncross-modal interaction to obtain significant benefits, and their cooperationachieves further improvements. Extensive experiments on MSCOCO and Flickr30Kdatasets validate that they can bring an impressive and consistent R@1 gain onmultiple models, confirming the general effectiveness and generalizationability of the proposed methods. Code and pre-trained models are available at:https://github.com/Paranioar/RCAR.
论文链接:http://arxiv.org/pdf/2303.13371v1
更多计算机论文:http://cspaper.cn/