Image–Text Matching Model Based on CLIP Bimodal Encoding

Image–text matching is a fundamental task in the multimodal research field, connecting computer vision and natural language processing by aligning visual content with corresponding textual descriptions. Accurate matching is critical for applications such as image captioning and text-based image retr...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Applied sciences 2024-11, Vol.14 (22), p.10384
Hauptverfasser: Zhu, Yihuan, Xu, Honghua, Du, Ailin, Wang, Bin
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Image–text matching is a fundamental task in the multimodal research field, connecting computer vision and natural language processing by aligning visual content with corresponding textual descriptions. Accurate matching is critical for applications such as image captioning and text-based image retrieval yet remains challenging due to the differences in data modalities. This paper addresses these challenges by proposing a robust image–text matching model inspired by Contrastive Language–Image Pre-training (CLIP). Our approach employs the Vision Transformer (ViT) model as the image encoder and Bidirectional Encoder Representations from Transformers (Bert) as the text encoder, integrating these into a shared vector space to measure semantic similarity. We enhance the model’s training efficiency using the LiT-tuning paradigm to optimize learning through a cosine decay strategy for dynamic adjustment of the learning rate. We validate our method on two benchmark datasets, WuKong and Flickr30k, demonstrating that our model achieves superior performance and significantly improves key evaluation metrics. The results underscore the model’s effectiveness in achieving accurate and robust image–text alignment.
ISSN:2076-3417
2076-3417
DOI:10.3390/app142210384