C4Net: Excavating Cross-modal Context- and Content-Complementarity for RGB-T Semantic Segmentation
The complementary properties exhibited upon RGB-T data involve context complementarity as well as content complementarity. During cross-modal feature fusion, most existing RGB-T semantic segmentation methods are dedicated to highlighting the exploitation of content-complementary information. Unfortu...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on circuits and systems for video technology 2024-10, p.1-1 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The complementary properties exhibited upon RGB-T data involve context complementarity as well as content complementarity. During cross-modal feature fusion, most existing RGB-T semantic segmentation methods are dedicated to highlighting the exploitation of content-complementary information. Unfortunately, these methods usually overlook the excavation of cross-modal context-complementary information ( i.e ., the contextual dependencies among different regions that only exist in one certain modality data) or try to exploit such cross-modal context-complementary information in an implicit way, yielding fragmentary semantic segmentation results. To remedy this problem, in this paper, a novel Cross-modal Context- and Content-Complementarity Network (C 4 Net) is presented for RGB-T semantic segmentation, in which both the cross-modal context-complementary information and the cross-modal content-complementary information are fully excavated and exploited during cross-modal feature fusion. Specifically, a Context-Complementary Information Aggregation (CxCIA) module is carefully designed, in which the cross-modal context-complementary information is explicitly excavated by measuring the discrepancies between contextual dependencies from different modality data. Then, such cross-modal context-complementary information is further exploited to enhance the original RGB and thermal contextual dependencies for boosting the integrity of objects in the fused features. In the meantime, a Content-Complementary Information Aggregation (CnCIA) module is presented, which highlights the utilization of cross-modal content-complementary information from a multi-scale perspective. Furthermore, an MLP-based Multi-level Feature Interaction (MFI) decoder is presented, in which the semantic gaps among different levels of fused features are mitigated by establishing the interactions of multi-level fused features along spatial and channel dimensions. Comprehensive experimental results on several public datasets demonstrate that our proposed C 4 Net surpasses other state-of-the-art models. |
---|---|
ISSN: | 1051-8215 |
DOI: | 10.1109/TCSVT.2024.3485655 |