Visual-semantic Alignment Temporal Parsing for Action Quality Assessment

Action Quality Assessment (AQA) is a challenging task involving analyzing fine-grained technical subactions, aligning high-level visual-semantic representations, and exploring internal temporal structures that capture the overall meaning of given action sequences. To address these challenges, we pro...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on circuits and systems for video technology 2024-10, p.1-1
Hauptverfasser:	Gedamu, Kumie, Ji, Yanli, Yang, Yang, Shao, Jie, Shen, Heng Tao
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy Action Quality Assessment (AQA) Circuits and systems Computer science Multimodal learning Quality assessment Self-supervised learning Semantics Spatiotemporal phenomena Sports Temporal parsing Training Transformers Visualization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Action Quality Assessment (AQA) is a challenging task involving analyzing fine-grained technical subactions, aligning high-level visual-semantic representations, and exploring internal temporal structures that capture the overall meaning of given action sequences. To address these challenges, we propose a Visual-semantic Alignment Temporal Parsing Network (VATP-Net) to understand the high-level visual semantics of subaction sequences and internal temporal structures without explicit supervision for action quality assessment. The proposed approach designs a self-supervised temporal parsing module to generate subaction sequences from the given video by aligning the visual and semantic action features. It captures high-level semantics and the internal temporal dynamics of subaction sequences. Furthermore, a multimodal interaction module is proposed to capture the interaction between different modalities of action features, enabling a comprehensive assessment of fine-grained and scene-invariant action details. The proposed module captures the intricate relationships and encourages interactions between different modalities within an action sequence, enhancing the overall understanding of action assessment. We exhaustively evaluate our proposed approach on the MTL-AQA, Rhythmic Gymnastics (RG), FineFS, and Fis-V datasets. Extensive experimental results demonstrate the effectiveness and feasibility of our proposed approach, which outperforms state-of-the-art methods by a significant margin.
ISSN:	1051-8215 1558-2205
DOI:	10.1109/TCSVT.2024.3487242