Unified multimodal fusion transformer for few shot object detection for remote sensing images

Object detection is a fundamental computer vision task with wide applications in remote sensing, but traditional methods strongly rely on large annotated datasets which are difficult to obtain, especially for novel object classes. Few-shot object detection (FSOD) aims to address this by using detect...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Information fusion 2024-11, Vol.111, p.102508, Article 102508
Hauptverfasser: Azeem, Abdullah, Li, Zhengzhou, Siddique, Abubakar, Zhang, Yuting, Zhou, Shangbo
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Object detection is a fundamental computer vision task with wide applications in remote sensing, but traditional methods strongly rely on large annotated datasets which are difficult to obtain, especially for novel object classes. Few-shot object detection (FSOD) aims to address this by using detectors to learn from very limited labeled data. Recent work fuse multi-modalities like image–text pairs to tackle data scarcity but require external region proposal network (RPN) to align cross-modal pairs which leads to a bias towards base classes and insufficient cross-modal contextual learning. To address these problems, we propose a unified multi-modal fusion transformer (UMFT), which extracts visual features from ViT and textual encodings from BERT to align multi-modal representations in an end-to-end manner. Specifically, affinity-guided fusion (AFM) captures semantically related image–text pairs by modeling their affinity relationships to selectively combine most informative pairs. In addition, cross-modal correlation module (CCM) captures discriminative inter-modal patterns between image and text representations and filters out unrelated features to enhance cross-modal alignment. By leveraging AFM to focus on semantic relationships and CCM to refine inter-modal features, the model better aligns multimodal data without RPN. These representations are passed to detection decoder that predicts bounding boxes, probabilities of class and cross-modal attributes. Evaluation of UMFT on benchmark datasets NWPU VHR-10 and DIOR demonstrated its ability to leverage limited image–text training data via dynamic fusion, achieving new state-of-the-art mean average precision (mAP) for few-shot object detection. Our code will be publicly available at https://github.com/abdullah-azeem/umft. •A novel unified multi-modal fusion transformer that leverages image–text pairs to improve few-shot object detection in remote sensing.•An affinity-guided fusion module that selectively combines relevant image–text pairs based on learned affinity relations.•A cross-modal correlation module that enhances discriminative inter-modal patterns based on similarity scores.•We achieve new SOTA mAP across various few-shot settings.
ISSN:1566-2535
1872-6305
DOI:10.1016/j.inffus.2024.102508