Research on Video Captioning Based on Multifeature Fusion

Aiming at the problems that the existing video captioning models pay attention to incomplete information and the generation of expression text is not accurate enough, a video captioning model that integrates image, audio, and motion optical flow is proposed. A variety of large-scale dataset pretrain...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Computational intelligence and neuroscience 2022-04, Vol.2022, p.1204909-14
Hauptverfasser:	Zhao, Hong, Guo, Lan, Chen, ZhiWen, Zheng, HouZe
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy Analysis Audio data Auditing Cider Collaboration Computational linguistics Datasets Embedded structures Feature extraction Language Language processing Methods Natural language Natural language interfaces Neural networks Optical flow (image analysis) Performance enhancement Performance indices Representations Social networks User generated content
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Aiming at the problems that the existing video captioning models pay attention to incomplete information and the generation of expression text is not accurate enough, a video captioning model that integrates image, audio, and motion optical flow is proposed. A variety of large-scale dataset pretraining models are used to extract video frame features, motion information, audio features, and video sequence features. An embedded layer structure based on self-attention mechanism is designed to embed single-mode features and learn single-mode feature parameters. Then, two schemes of joint representation and cooperative representation are used to fuse the multimodal features of the feature vectors output by the embedded layer, so that the model can pay attention to different targets in the video and their interactive relationships, which effectively improves the performance of the video captioning model. The experiment is carried out on large datasets MSR-VTT and LSMDC. Under the metrics BLEU4, METEOR, ROUGEL, and CIDEr, the MSR-VTT benchmark dataset obtained scores of 0.443, 0.327, 0.619, and 0.521, respectively. The result shows that the proposed method can effectively improve the performance of the video captioning model, and the evaluation indexes are improved compared with comparison models.
ISSN:	1687-5265 1687-5273
DOI:	10.1155/2022/1204909