Action recognition using global spatio-temporal features derived from sparse representations

•A spatiotemporal feature detector (for human actions) based on sparse representations.•Features are obtained by ranking the most salient regions.•Descriptors are used in a bag-of-features based classification framework.•The performance is evaluated on three standard human action datasets.•We report...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Computer vision and image understanding 2014-06, Vol.123, p.1-13
Hauptverfasser: Somasundaram, Guruprasad, Cherian, Anoop, Morellas, Vassilios, Papanikolopoulos, Nikolaos
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•A spatiotemporal feature detector (for human actions) based on sparse representations.•Features are obtained by ranking the most salient regions.•Descriptors are used in a bag-of-features based classification framework.•The performance is evaluated on three standard human action datasets.•We report very competitive performance using our proposed approach. Recognizing actions is one of the important challenges in computer vision with respect to video data, with applications to surveillance, diagnostics of mental disorders, and video retrieval. Compared to other data modalities such as documents and images, processing video data demands orders of magnitude higher computational and storage resources. One way to alleviate this difficulty is to focus the computations to informative (salient) regions of the video. In this paper, we propose a novel global spatio-temporal self-similarity measure to score saliency using the ideas of dictionary learning and sparse coding. In contrast to existing methods that use local spatio-temporal feature detectors along with descriptors (such as HOG, HOG3D, and HOF), dictionary learning helps consider the saliency in a global setting (on the entire video) in a computationally efficient way. We consider only a small percentage of the most salient (least self-similar) regions found using our algorithm, over which spatio-temporal descriptors such as HOG and region covariance descriptors are computed. The ensemble of such block descriptors in a bag-of-features framework provides a holistic description of the motion sequence which can be used in a classification setting. Experiments on several benchmark datasets in video based action classification demonstrate that our approach performs competitively to the state of the art.
ISSN:1077-3142
1090-235X
DOI:10.1016/j.cviu.2014.01.002