Human action recognition in videos based on spatiotemporal features and bag-of-poses

Currently, there is a large number of methods that use 2D poses to represent and recognize human action in videos. Most of these methods use information computed from raw 2D poses based on the straight line segments that form the body parts in a 2D pose model in order to extract features (e.g., angl...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Applied soft computing 2020-10, Vol.95, p.106513, Article 106513
Hauptverfasser: Varges da Silva, Murilo, Nilceu Marana, Aparecido
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Currently, there is a large number of methods that use 2D poses to represent and recognize human action in videos. Most of these methods use information computed from raw 2D poses based on the straight line segments that form the body parts in a 2D pose model in order to extract features (e.g., angles and trajectories). In our work, we propose a new method of representing 2D poses. Instead of directly using the straight line segments, firstly, the 2D pose is converted to the parameter space in which each segment is mapped to a point. Then, from the parameter space, spatiotemporal features are extracted and encoded using a Bag-of-Poses approach, then used for human action recognition in the video. Experiments on two well-known public datasets, Weizmann and KTH, showed that the proposed method using 2D poses encoded in parameter space can improve the recognition rates, obtaining competitive accuracy rates compared to state-of-the-art methods. [Display omitted] •We propose a new way to represent 2D poses using straight-line parameter space (each straight-line segment obtained from a 2D pose is mapped into a point at the parameter space).•We propose a new set of spatiotemporal descriptors based in 2D poses, (e.g., angles formed between parts of the human skeleton in each frame of the video) and temporal information (e.g., the trajectory of each part of the human skeleton in the course of the frames of the video).•We propose a new Bag-of-Poses approach to encode the spatiotemporal descriptors in high-level features.•Our descriptors are robust (obtained good results when compared with some important human action descriptors found in the literature).•Our descriptors are light and fast to compute compared to other methods.
ISSN:1568-4946
1872-9681
DOI:10.1016/j.asoc.2020.106513