A Two-Stream Hybrid CNN-Transformer Network for Skeleton-Based Human Interaction Recognition

Human Interaction Recognition (HIR) is the process of identifying and understanding interactive actions and activities between multiple participants in a specific environment or situation. Many single Convolutional Neural Networks (CNN) has issues, such as the inability to capture global instance in...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Yin, Ruoqi, Yin, Jianqin
Format: Buchkapitel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Human Interaction Recognition (HIR) is the process of identifying and understanding interactive actions and activities between multiple participants in a specific environment or situation. Many single Convolutional Neural Networks (CNN) has issues, such as the inability to capture global instance interaction features or difficulty in training, leading to ambiguity in action semantics. In this work, we propose a Two-stream Hybrid CNN-Transformer Network (THCT-Net), which exploits the local specificity of CNN and models global dependencies through the Transformer. CNN and Transformer simultaneously model the entity, time and space relationships between interactive entities respectively. Multi-grained information modelling is employed to enhance the accuracy and robustness of the action recognition system. Experimental results on diverse and challenging datasets, such as NTU-RGBD, H2O, and Assembly101, demonstrate that the proposed method can better comprehend and infer the meaning and context of various actions, outperforming state-of-the-art methods.
ISSN:0302-9743
1611-3349
DOI:10.1007/978-981-97-8511-7_28