Text–video retrieval re-ranking via multi-grained cross attention and frozen image encoders

State-of-the-art methods for text–video retrieval generally leverage CLIP embeddings and cosine similarity for efficient retrieval. Meanwhile, recent advancements in cross-attention techniques introduce transformer decoders to facilitate attention computation between text queries and visual tokens e...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Pattern recognition 2025-03, Vol.159, p.111099, Article 111099
Hauptverfasser: Dai, Zuozhuo, Cheng, Kaihui, Shao, Fangtao, Dong, Zilong, Zhu, Siyu
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:State-of-the-art methods for text–video retrieval generally leverage CLIP embeddings and cosine similarity for efficient retrieval. Meanwhile, recent advancements in cross-attention techniques introduce transformer decoders to facilitate attention computation between text queries and visual tokens extracted from video frames, enabling a more comprehensive interaction between textual and visual information. In this study, we combine the advantages of both approaches and propose a fine-grained re-ranking approach incorporating a multi-grained text–video cross attention module. Specifically, the re-ranker enhances the top K similar candidates identified by the cosine similarity network. To explore video and text interactions efficiently, we introduce frame and video token selectors to obtain salient visual tokens at both frame and video levels. Then, a multi-grained cross-attention mechanism is applied between text and visual tokens at these levels to capture multimodal information. To reduce the training overhead associated with the multi-grained cross-attention module, we freeze the vision backbone and only train the multi-grained cross attention module. This frozen strategy allows for scalability to larger pre-trained vision models such as ViT-G, leading to enhanced retrieval performance. Experimental evaluations on text–video retrieval datasets showcase the effectiveness and scalability of our proposed re-ranker combined with existing state-of-the-art methodologies. •Introduction of CrossTVR, a multi-grained cross attention based text–video retrieval approach, enhancing interactions between textual and visual modalities efficiently.•Implementation of a frozen strategy for vision pre-trained models to handle large-scale video datasets training effectively, circumventing computational inefficiencies of end-to-end fine-tuning.•Demonstration of state-of-the-art performance across various text–video retrieval benchmarks, showcasing scalability to larger vision foundation models and improved model compatibility with evolving vision–language landscapes.
ISSN:0031-3203
DOI:10.1016/j.patcog.2024.111099