Multimedia analysis of robustly optimized multimodal transformer based on vision and language co-learning

Recently, research on multimodal learning using all modality information has been conducted to detect disinformation on multimedia. Existing multimodal learning methods include score-level fusion approaches combining different models, and feature-level fusion methods combining embedding vectors to i...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Information fusion 2023-12, Vol.100, p.101922, Article 101922
Hauptverfasser:	Yoon, JunHo, Choi, GyuHo, Choi, Chang
Format:	Artikel
Sprache:	eng
Schlagworte:	Classification Multi-modal Multimedia Natural disasters
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Recently, research on multimodal learning using all modality information has been conducted to detect disinformation on multimedia. Existing multimodal learning methods include score-level fusion approaches combining different models, and feature-level fusion methods combining embedding vectors to integrate data of different dimensions. Because a late-level fusion method is combined after the modalities are individually operated, there is a limit in that the recognition performance of a unimodal determines the performance. In addition, a fusion method has constraints in that the data among the modalities must be matched. In this study, we propose a classification system using a RoBERTa-based multimodal fusion transformer (RoBERTaMFT) that applies a co-learning method to solve the limitations of the recognition performance of multimodal learning as well as the data imbalance among the modalities. RoBERTaMFT consists of image feature extraction, co-learning using the reconstruction of image features with text embedding, and a late-level fusion step applied to the final classification. As experiment results using the CrisisMMD dataset indicate, RoBERTaMFT achieved an accuracy 21.2% and an f1-score 0.414 higher than those of unimodal learning, and an accuracy 11.7% and an f1-score 0.268 higher than those of existing multimodal learning. •Multi-modal uses two or more modalities, such as text, images, or audio.•Multi-modal learning supplements the lack of information from other modalities.•Uni-modal performance and each modality must be matched for multi-modal learning.•Multi-modal co-learning can resolve the limitation of performance and imbalance.
ISSN:	1566-2535 1872-6305
DOI:	10.1016/j.inffus.2023.101922