A Two-Stream Hybrid CNN-Transformer Network for Skeleton-Based Human Interaction Recognition
Human Interaction Recognition (HIR) is the process of identifying and understanding interactive actions and activities between multiple participants in a specific environment or situation. Many single Convolutional Neural Networks (CNN) has issues, such as the inability to capture global instance in...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Buchkapitel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Human Interaction Recognition (HIR) is the process of identifying and understanding interactive actions and activities between multiple participants in a specific environment or situation. Many single Convolutional Neural Networks (CNN) has issues, such as the inability to capture global instance interaction features or difficulty in training, leading to ambiguity in action semantics. In this work, we propose a Two-stream Hybrid CNN-Transformer Network (THCT-Net), which exploits the local specificity of CNN and models global dependencies through the Transformer. CNN and Transformer simultaneously model the entity, time and space relationships between interactive entities respectively. Multi-grained information modelling is employed to enhance the accuracy and robustness of the action recognition system. Experimental results on diverse and challenging datasets, such as NTU-RGBD, H2O, and Assembly101, demonstrate that the proposed method can better comprehend and infer the meaning and context of various actions, outperforming state-of-the-art methods. |
---|---|
ISSN: | 0302-9743 1611-3349 |
DOI: | 10.1007/978-981-97-8511-7_28 |