Deep sequential collaborative cognition of vision and language based model for video description
Video description is to translate video into natural language with appropriate sentence patterns and decent words. The task is challenging due to the great semantic gap between visual content and language. Nowadays, many well-designed models are developed. However, the language information is often...
Gespeichert in:
Veröffentlicht in: | Multimedia tools and applications 2023-09, Vol.82 (23), p.36207-36230 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Video description is to translate video into natural language with appropriate sentence patterns and decent words. The task is challenging due to the great semantic gap between visual content and language. Nowadays, many well-designed models are developed. However, the language information is often insufficiently discovered and cannot be effectively integrated with visual representation, leading to that the correlations of vision and language are difficult to be constructed. Inspired by the process of human learning and cognition for vision and language, a deep collaborative cognition of vision and language based model (VL-DCC) is proposed in this work. In detail, an extra language encoding branch is designed and integrated with the visual motion encoding branch based on sequence to sequence pipeline during model learning, to simulate the process of human learning visual information and language. Additionally, a double VL-DCC (DVL-DCC) framework is developed to further improve the quality of generated sentences, where the element-wise addition and feature concatenation operation are employed and implemented on two different VL-DCC modules respectively to comprehensively capture visual and language semantics. Experiments on MSVD and MSR-VTT2016 datasets are conducted to evaluate the proposed model, and better results are achieved compared with the baseline model and other popular works, with the CIDEr reaching to 81.3 and 46.7 on the two datasets respectively. |
---|---|
ISSN: | 1380-7501 1573-7721 |
DOI: | 10.1007/s11042-023-14887-z |