SwinFusion: Cross-domain Long-range Learning for General Image Fusion via Swin Transformer

This study proposes a novel general image fusion framework based on cross-domain long-range learning and Swin Transformer, termed as SwinFusion. On the one hand, an attention-guided cross-domain module is devised to achieve sufficient integration of complementary information and global interaction....

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE/CAA journal of automatica sinica 2022-07, Vol.9 (7), p.1200-1217
Hauptverfasser:	Ma, Jiayi, Tang, Linfeng, Fan, Fan, Huang, Jun, Mei, Xiaoguang, Ma, Yong
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Computer vision Cross-domain long-range learning Digital imaging image fusion Image processing Image segmentation Information retrieval Learning Maintenance engineering Network architecture Object detection Semantics Swin transformer Texture Transformers
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	This study proposes a novel general image fusion framework based on cross-domain long-range learning and Swin Transformer, termed as SwinFusion. On the one hand, an attention-guided cross-domain module is devised to achieve sufficient integration of complementary information and global interaction. More specifically, the proposed method involves an intra-domain fusion unit based on self-attention and an inter-domain fusion unit based on cross-attention, which mine and integrate long dependencies within the same domain and across domains. Through long-range dependency modeling, the network is able to fully implement domain-specific information extraction and cross-domain complementary information integration as well as maintaining the appropriate apparent intensity from a global perspective. In particular, we introduce the shifted windows mechanism into the self-attention and cross-attention, which allows our model to receive images with arbitrary sizes. On the other hand, the multi-scene image fusion problems are generalized to a unified framework with structure maintenance, detail preservation, and proper intensity control. Moreover, an elaborate loss function, consisting of SSIM loss, texture loss, and intensity loss, drives the network to preserve abundant texture details and structural information, as well as presenting optimal apparent intensity. Extensive experiments on both multi-modal image fusion and digital photography image fusion demonstrate the superiority of our SwinFusion compared to the state-of-the-art unified image fusion algorithms and task-specific alternatives. Implementation code and pre-trained weights can be accessed at https://github.com/Linfeng-Tang/SwinFusion.
ISSN:	2329-9266 2329-9274
DOI:	10.1109/JAS.2022.105686