Multimodal Image Fusion via Self-Supervised Transformer
Multimodal image fusion (MMIF) aims to synthesize one more informative image from complementary multimodal images. Due to the absence of ground truth, most MMIF works utilize unsupervised loss functions to preserve specific source information in the fused image, suffering from subjective information...
Gespeichert in:
Veröffentlicht in: | IEEE sensors journal 2023-05, Vol.23 (9), p.9796-9807 |
---|---|
Hauptverfasser: | , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Multimodal image fusion (MMIF) aims to synthesize one more informative image from complementary multimodal images. Due to the absence of ground truth, most MMIF works utilize unsupervised loss functions to preserve specific source information in the fused image, suffering from subjective information preservation and high demand for a large training set. Self-supervised learning is a feasible technology to avoid these issues. Current self-supervised MMIF works pretrain the network using a large natural image set via the vanilla image reconstruction task, then implement image fusion by integrating fusion rules into the well-trained network. However, they ignore the domain discrepancy between training data (i.e., the natural image set) and test data (i.e., the MMIF set). In addition, existing MMIF methods are mostly built based on traditional convolutional neural networks, leading to the inability of modeling long-range contextual relationships of source modalities. To this end, we propose a self-supervised transformer for MMIF. Under the self-supervised framework, we design random image super-resolution (RSR) as the pretext task to train the network. Both RSR and MMIF have similar multiresolution inputs and common goals: image restoration and enhancement, which are beneficial for diminishing the domain discrepancy between the train and test data. Moreover, we present a transformer-based fusion network for capturing both local and global information of source modalities. The proposed method is validated on multiple MMIF datasets. Qualitative and quantitative experimental results demonstrate that the proposed method can yield superior fusion quality compared to the state-of-the-art image fusion methods. |
---|---|
ISSN: | 1530-437X 1558-1748 |
DOI: | 10.1109/JSEN.2023.3263336 |