Multimodal Boosting: Addressing Noisy Modalities and Identifying Modality Contribution
In multimodal representation learning, different modalities do not contribute equally. Especially when learning with noisy modalities that convey non-discriminative information, the prediction based on multimodal representation is often biased and even ignores the knowledge from informative modaliti...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on multimedia 2024-01, Vol.26, p.3018-3033 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In multimodal representation learning, different modalities do not contribute equally. Especially when learning with noisy modalities that convey non-discriminative information, the prediction based on multimodal representation is often biased and even ignores the knowledge from informative modalities. In this paper, we aim to address the noisy modality problem and balance the contributions of multiple modalities dynamically in a parallel format. Specifically, we construct multiple base learners and formulate our framework as a boosting-like algorithm, where different base learners focus on different aspects of multimodal learning. To identify the contributions of individual base learners, we develop a contribution learning network that dynamically determines the contribution and noise level of each base learner. In contrast to the commonly considered attention mechanism, we define the transformation of predictive loss as the supervision signal to train the contribution learning network, which enables more accurate learning of modality importance. We derive the final prediction by incorporating the predictions of base learners based on their contributions. Notably, different from late fusion, we devise a multimodal base learner to explore the cross-modal interactions. To update the network, we design the 'complementary update mechanism', where for each base learner, we assign higher weights to those samples that are incorrectly predicted by other base learners. In this way, we can leverage the available information to correctly predict each sample to the utmost extent and enable different base learners to learn different aspects of multimodal information. Extensive experiments demonstrate that the proposed method achieves superior performance on multimodal sentiment analysis and emotion recognition. |
---|---|
ISSN: | 1520-9210 1941-0077 |
DOI: | 10.1109/TMM.2023.3306489 |