VATMAN: Integrating Video-Audio-Text for Multimodal Abstractive SummarizatioN via Crossmodal Multi-Head Attention Fusion

The paper introduces VATMAN (Video-Audio-Text Multimodal Abstractive summarizatioN), a novel approach for generating hierarchical multimodal summaries utilizing Trimodal Hierarchical Multi-head Attention. Unlike existing generative pre-trained language models, VATMAN employs a hierarchical attention...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE access 2024, Vol.12, p.119174-119184
Hauptverfasser: Baek, Doosan, Kim, Jiho, Lee, Hongchul
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The paper introduces VATMAN (Video-Audio-Text Multimodal Abstractive summarizatioN), a novel approach for generating hierarchical multimodal summaries utilizing Trimodal Hierarchical Multi-head Attention. Unlike existing generative pre-trained language models, VATMAN employs a hierarchical attention mechanism that hierarchically attends to visual, audio, and text modalities. However, in the existing literature, there is a lack of cross-modal attention at the block level. In light of this, we propose a block-level cross-modal attention mechanism, termed Blockwise Cross-modal Multi-head Attention (BCMA), to enhance the summarization performance. This attention mechanism enables the model to simultaneously capture context information from visual, audio, and text modalities, providing a more comprehensive understanding of the input data. In terms of performance, our VATMAN model outperforms the state-of-the-art trimodal model based on RNN in the How2 dataset. Specifically, it achieves a Rouge-1 improvement of 7.53% and Rouge-L improvement of 2.19%, demonstrating superior summarization quality. In addition, compared to uni-modal and di-modal baseline transformer models, VATMAN exhibits significant improvements in Rouge-L scores by 11.12% and 3.85%, respectively, highlighting its effectiveness in capturing hierarchical relationships across modalities. Furthermore, we evaluated our generated abstractive summaries using various metrics, including BLEU, METEOR, CIDEr, ContentF1, and BERTScore. Our proposed model consistently outperformed others across most metrics, demonstrating its effective performance in qualitative assessments.
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2024.3447737