An effective video captioning based on language description using a novel Graylag Deep Kookaburra Reinforcement Learning

Video captioning exhibits a complex challenge, particularly due to the increased subject intensity within videos compared to image caption generation. The presence of redundant visual information in video data adds complexity for captioners, making it difficult to simplify various content and elimin...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:EURASIP journal on image and video processing 2025-01, Vol.2025 (1), p.1-27, Article 1
Hauptverfasser: Shankar, M. Gowri, Surendran, D.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Video captioning exhibits a complex challenge, particularly due to the increased subject intensity within videos compared to image caption generation. The presence of redundant visual information in video data adds complexity for captioners, making it difficult to simplify various content and eliminate irrelevant elements. Additionally, this redundancy often results in misalignment with equivalent visual semantics in the ground truth, further complicating the video captioning process. In response to these challenges, this research introduces the Graylag Deep Kookaburra Reinforcement Learning (GDKRL) framework, which enhances video captioning through a multi-stage process. First, object detection is performed using the single-shot multi-box detector with generalized intersection over union for accurate object tracking and similarity calculation. Next, the gazelle autoencoder extracts and fuses features from video frames, integrating complex visual and temporal information into a unified representation. The residual convolved dual sparse graph attention network then generates detailed and contextually rich language descriptions by applying dual sparse attention mechanisms and residual convolutions. Finally, hybrid graylag and kookaburra optimization refine the captioning process, producing comprehensive and precise textual descriptions of the video content. Extensive experiments on MSVD yielded 81.79 BLEU-4, 51.2 METEOR, 133.3 CIDEr and 81.7 ROUGE-L; VATEX achieved 62.29 BLEU-4, 51.2 METEOR, 110.2 CIDEr and 78.45 ROUGE-L; and the MSR-VTT dataset obtained 44.52 BLEU-4, 33.35 METEOR, 63.9 CIDEr and 68.9 ROUGE-L, demonstrating that the proposed technique significantly outperforms previous approaches and highlights its effectiveness.
ISSN:1687-5281
1687-5176
1687-5281
DOI:10.1186/s13640-024-00662-z