Comparison of motion-based approaches for multi-modal action and gesture recognition from RGB-D
Automatic action and gesture recognition research field has growth in interest over the last few years. Action recognition can be understood as the automatic classification of generic human actions or activities, such as walking, reading, jumping, etc. while gesture recognition focuses on the analys...
Gespeichert in:
1. Verfasser: | |
---|---|
Format: | Dissertation |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Automatic action and gesture recognition research field has growth in interest over the last few
years. Action recognition can be understood as the automatic classification of generic human actions or
activities, such as walking, reading, jumping, etc. while gesture recognition focuses on the analysis of
more concrete movements, usually from the upper body, which have a meaning by their own, as waving,
saluting, negating, etc. Such interest on the domain comes mainly from its many applications, which
include, human-computer interaction, ambient assisted living systems, health care monitoring systems,
surveillance, communications, entertainment, etc. This concrete domain shares many similarities with
object recognition from still images, nevertheless, it has shown a special characteristic that turns it
into a very challenging task. That is, the temporal evolution of actions and gestures. The scenario
that is found nowadays into the author’s community is a competition on finding out how to deal with
this extra dimensionality. Therefore, the project starts with an exhaustive state-of-the-art analysis,
where the most common approaches for dealing with time are summarized. Hand-crafted features
rely on the extension of 2D descriptors, such as HoG or SIFT to a third dimension (time) and also the
definition of descriptors based on motion features, such as optical flow or scene flow, meanwhile, deep
learning models can be categorized into four non-mutually exclusive categories according on how
they deal with time: 2D CNNs that perform recognition on still images from videos, averaging results
for each of them, 2D CNNs applied over motion features, 3D CNNs able to compute 3D convolutions
over 2 spatial dimension and 1 temporal dimension and neural networks which can model temporal
evolution, such as RNN and LSTM. After reviewing the literature, a selection and testing of some of
this methods is performed to find the direction in which should point the future research on the domain.
Additionally, the recent increase on availability of depth sensors (Microsoft’s Kinnect V1 and V2)
allow the exploration of multi-modal techniques that take advantage of multiple data sources (RGB
and depth). The domain’s background has shown how many algorithms can benefit from this extra
modality, by itself or combining with classical RGB. For these reasons, it is mandatory to test as well
techniques that rely on multi-modal data, to do so, one of the algorithms selected has been modif |
---|