Multimodal Boosting: Addressing Noisy Modalities and Identifying Modality Contribution

In multimodal representation learning, different modalities do not contribute equally. Especially when learning with noisy modalities that convey non-discriminative information, the prediction based on multimodal representation is often biased and even ignores the knowledge from informative modaliti...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on multimedia 2024-01, Vol.26, p.3018-3033
Hauptverfasser: Mai, Sijie, Sun, Ya, Xiong, Aolin, Zeng, Ying, Hu, Haifeng
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In multimodal representation learning, different modalities do not contribute equally. Especially when learning with noisy modalities that convey non-discriminative information, the prediction based on multimodal representation is often biased and even ignores the knowledge from informative modalities. In this paper, we aim to address the noisy modality problem and balance the contributions of multiple modalities dynamically in a parallel format. Specifically, we construct multiple base learners and formulate our framework as a boosting-like algorithm, where different base learners focus on different aspects of multimodal learning. To identify the contributions of individual base learners, we develop a contribution learning network that dynamically determines the contribution and noise level of each base learner. In contrast to the commonly considered attention mechanism, we define the transformation of predictive loss as the supervision signal to train the contribution learning network, which enables more accurate learning of modality importance. We derive the final prediction by incorporating the predictions of base learners based on their contributions. Notably, different from late fusion, we devise a multimodal base learner to explore the cross-modal interactions. To update the network, we design the 'complementary update mechanism', where for each base learner, we assign higher weights to those samples that are incorrectly predicted by other base learners. In this way, we can leverage the available information to correctly predict each sample to the utmost extent and enable different base learners to learn different aspects of multimodal information. Extensive experiments demonstrate that the proposed method achieves superior performance on multimodal sentiment analysis and emotion recognition.
ISSN:1520-9210
1941-0077
DOI:10.1109/TMM.2023.3306489