Energy Shuttle Graph Convolution for Multimodal Relation Recognition in Videos
The reasoning of social relationships between characters from visual information can assist people in determining characters' roles and understanding the interaction patterns of characters in different social contexts. Currently, research has transitioned from static images to the field of vide...
Gespeichert in:
Veröffentlicht in: | IEEE access 2024, Vol.12, p.120077-120086 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The reasoning of social relationships between characters from visual information can assist people in determining characters' roles and understanding the interaction patterns of characters in different social contexts. Currently, research has transitioned from static images to the field of video, yet the full utilization of the rich information contained in videos remains an urgent problem to be solved. On the one hand, most existing research does not consider the multimodal features of visual, linguistic text, audio, and other modalities contained in videos. On the other hand, the temporal features in videos can provide a global horizon for reasoning character relationships. Therefore, we propose an ESRR reasoning framework based on multimodal graph convolutional networks to identify character relationships in videos. In the spatial domain, we use the energization operation to extract multimodal features and model them as graphs. In the temporal domain, we use the LSTM stack and energy transfer between graphs to extract temporal features of videos from both local and global perspectives. In conclusion, ESRR can extract comprehensive video features to analyze the spatial-temporal dependencies of characters and relationship labels, as well as the reasoning of social relationships. Extensive experiments conducted on datasets have demonstrated that ESRR exhibits superior performance in video character relationship recognition, outperforming other mainstream models at this stage. |
---|---|
ISSN: | 2169-3536 2169-3536 |
DOI: | 10.1109/ACCESS.2024.3450312 |