Contrastive masked auto-encoders based self-supervised hashing for 2D image and 3D point cloud cross-modal retrieval
Implementing cross-modal hashing between 2D images and 3D point-cloud data is a growing concern in real-world retrieval systems. Simply applying existing cross-modal approaches to this new task fails to adequately capture latent multi-modal semantics and effectively bridge the modality gap between 2...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Implementing cross-modal hashing between 2D images and 3D point-cloud data is
a growing concern in real-world retrieval systems. Simply applying existing
cross-modal approaches to this new task fails to adequately capture latent
multi-modal semantics and effectively bridge the modality gap between 2D and
3D. To address these issues without relying on hand-crafted labels, we propose
contrastive masked autoencoders based self-supervised hashing (CMAH) for
retrieval between images and point-cloud data. We start by contrasting 2D-3D
pairs and explicitly constraining them into a joint Hamming space. This
contrastive learning process ensures robust discriminability for the generated
hash codes and effectively reduces the modality gap. Moreover, we utilize
multi-modal auto-encoders to enhance the model's understanding of multi-modal
semantics. By completing the masked image/point-cloud data modeling task, the
model is encouraged to capture more localized clues. In addition, the proposed
multi-modal fusion block facilitates fine-grained interactions among different
modalities. Extensive experiments on three public datasets demonstrate that the
proposed CMAH significantly outperforms all baseline methods. |
---|---|
DOI: | 10.48550/arxiv.2408.05711 |