Bias Mitigation and Representation Optimization for Noise-Robust Cross-modal Retrieval

The remarkable progress in cross-modal retrieval relies on accurately-annotated multimedia datasets. In practice, most existing datasets used for training cross-modal retrieval models are automatically collected from the Internet to reduce data collection costs. However, it inevitably contains misma...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:ACM transactions on multimedia computing communications and applications 2024-10
Hauptverfasser: Liu, Yu, Chen, Haipeng, Qin, Guihe, Song, Jincai, Yang, Xun
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The remarkable progress in cross-modal retrieval relies on accurately-annotated multimedia datasets. In practice, most existing datasets used for training cross-modal retrieval models are automatically collected from the Internet to reduce data collection costs. However, it inevitably contains mismatched pairs, i.e., noisy correspondences, thus degrading the model performance. Recent advances utilize the predicted similarity distribution of individual samples for noise validation and correction, which easily faces two challenging dilemmas: 1) confirmation bias and 2) unstable performance with increasing noise. In light of the above, we propose a generalized Bias Mitigation and Representation Optimization framework (BMRO). Specifically, we propose a Bias Estimator (BE) to estimate the unbiased confidence factor of a sample by contrasting it against its nearest neighbors. Unbiased confidence factor can precisely adjust sample contribution and enhance accurate sample division. This facilitates the Adaptive Representation Optimizer (ARO) in providing tailored optimization strategies for clean and noisy samples. ARO performs contrastive learning between clean samples and generated hard samples, thus promoting the generalizability and robustness of the representation. Besides, it utilizes complementary learning to reduce incorrect guidance from noisy samples. Extensive experiments on five visual-text benchmarks verify that our BMRO can significantly improve the matching accuracy and performance stability against noisy correspondences.
ISSN:1551-6857
1551-6865
DOI:10.1145/3700596