Multi-modality learning for human action recognition

The multi-modality based human action recognition is an increasing topic. Multi-modality can provide more abundant and complementary information than single modality. However, it is difficult for multi-modality learning to capture the spatial-temporal information from the entire RGB and depth sequen...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Multimedia tools and applications 2021-05, Vol.80 (11), p.16185-16203
Hauptverfasser: Ren, Ziliang, Zhang, Qieshi, Gao, Xiangyang, Hao, Pengyi, Cheng, Jun
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The multi-modality based human action recognition is an increasing topic. Multi-modality can provide more abundant and complementary information than single modality. However, it is difficult for multi-modality learning to capture the spatial-temporal information from the entire RGB and depth sequence effectively. In this paper, to obtain better representation of spatial-temporal information, we propose a bidirectional rank pooling method to construct the RGB Visual Dynamic Images (VDIs) and Depth Dynamic Images (DDIs). Furthermore, we design an effective segmentation convolutional networks (ConvNets) architecture based on multi-modality hierarchical fusion strategy for human action recognition. The proposed method has been verified and achieved the state-of-the-art results on the widely used NTU RGB+D, SYSU 3D HOI and UWA3D II datasets.
ISSN:1380-7501
1573-7721
DOI:10.1007/s11042-019-08576-z