Few-Shot Object Detection in Aerial Imagery Guided by Text-Modal Knowledge

Few-shot object detection (FSOD) has received numerous attention due to the difficulty and time-consuming of labeling objects. Recent researches achieve excellent performance in natural scene by only using a few instances of novel classes to fine-tune the last prediction layer of the model well-trai...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on geoscience and remote sensing 2023-01, Vol.61, p.1-1
Hauptverfasser: Lu, Xiaonan, Sun, Xian, Diao, Wenhui, Mao, Yongqiang, Li, Junxi, Zhang, Yidan, Wang, Peijin, Fu, Kun
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Few-shot object detection (FSOD) has received numerous attention due to the difficulty and time-consuming of labeling objects. Recent researches achieve excellent performance in natural scene by only using a few instances of novel classes to fine-tune the last prediction layer of the model well-trained on plentiful base data. However, compared with natural scene objects with a single direction and small size variety, the direction and size of the objects in remote sensing images (RSIs) vary greatly. The methods proposed for natural scene cannot be directly applied for RSIs. In this paper, we first propose a strong baseline for RSIs. It fine-tunes all detector components acting on high-level features and effectively improves the performance of novel classes. Further analyzing the results of the baseline, we find that the error for novel classes is mainly concentrated in classification. It misclassifies novel classes as confusable base classes or background due to the difficulty in extracting generalized information from limited instances. As is well-known, text-modal knowledge can highly summarize the generalized and unique characteristics of categories. Thus, we introduce text-modal descriptions for each category and propose a FSOD method guided by TExt-MOdal knowledge, called TEMO. Specifically, a text-modal knowledge extractor and a cross-modal assembly module are proposed to extract text features and fuse the text-modal features into visual-modal features. The fused features greatly reduce the classification confusion of novel classes. Furthermore, we introduce a mask strategy and a separation loss to avoid over-fitting and ambiguity of text-modal features. Experimental results on DIOR, NWPU, and FAIR1M illustrate that our TEMO achieves state-of-the-art performance on all settings.
ISSN:0196-2892
1558-0644
DOI:10.1109/TGRS.2023.3250448