A transfer learning-based efficient spatiotemporal human action recognition framework for long and overlapping action classes

Deep learning base solutions for computer vision made life easier for humans. Video data contain a lot of hidden information and patterns, that can be used for Human Action Recognition (HAR). HAR can apply to many areas, such as behavior analysis, intelligent video surveillance, and robotic vision....

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	The Journal of supercomputing 2022-02, Vol.78 (2), p.2873-2908
Hauptverfasser:	Bilal, Muhammad, Maqsood, Muazzam, Yasmin, Sadaf, Hasan, Najam Ul, Rho, Seungmin
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial Intelligence based Deep Video Data Analytics Compilers Computer Science Computer vision Data transmission Deep learning Feature extraction Human activity recognition Human motion Interpreters Iterative methods Machine learning Machine vision Occlusion Processor Architectures Programming Languages Surveillance Video data
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Deep learning base solutions for computer vision made life easier for humans. Video data contain a lot of hidden information and patterns, that can be used for Human Action Recognition (HAR). HAR can apply to many areas, such as behavior analysis, intelligent video surveillance, and robotic vision. Occlusion, viewpoint variation, and illumination are some issues that make the HAR task more difficult. Some action classes have similar actions or some overlapping parts in them. This, among many other problems, is the main reason that contributes the most to misclassification. Traditional hand-engineering and machine learning-based solutions lack the ability to handle overlapping actions. In this paper, we propose a deep learning-based spatiotemporal HAR framework for overlapping human actions in long videos. Transfer learning techniques are used for deep feature extraction. Fine-tuned pre-trained CNN models learn the spatial relationship at the frame level. An optimized Deep Autoencoder was used to squeeze high-dimensional deep features. An RNN with LSTM was used to learn the long-term temporal relationships. An iterative module added at the end to fine-tune the trained model on new videos that learns and adopt changes. Our proposed framework achieved state-of-the-art performance in spatiotemporal HAR for overlapping human actions in long visual data streams for non-stationary surveillance environments.
ISSN:	0920-8542 1573-0484
DOI:	10.1007/s11227-021-03957-4