First-Person Daily Activity Recognition With Manipulated Object Proposals and Non-Linear Feature Fusion

Most previous works on the first-person video recognition focus on measuring the similarity of different actions by using low-level features of objects interacting with humans. However, due to noisy camera motion and frequent changes in viewpoint and scale, they fail to capture and model highly disc...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on circuits and systems for video technology 2018-10, Vol.28 (10), p.2946-2955
Hauptverfasser: Wang, Meng, Luo, Changzhi, Ni, Bingbing, Yuan, Jun, Wang, Jianfeng, Yan, Shuicheng
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Most previous works on the first-person video recognition focus on measuring the similarity of different actions by using low-level features of objects interacting with humans. However, due to noisy camera motion and frequent changes in viewpoint and scale, they fail to capture and model highly discriminative object features. In this paper, we propose a novel pipeline for the first-person daily activity recognition. Our object feature extraction pipeline is inspired by the recent success of object hypotheses and deep convolutional neural network (CNN)-based detection frameworks. Our key contribution is a simple yet effective manipulated object proposal generation scheme. This scheme leverages motion cues, such as motion boundary and motion magnitude (in contrast, camera motion is usually considered as "noise" for most previous methods), to generate a more compact and discriminative set of object proposals, which are more closely related to the objects, which are being manipulated. Then, we learn more discriminative object detectors from these manipulated object proposals based on region-based CNN. Meanwhile, we develop a non-linear feature fusion scheme, which better combines object and motion features. We show in experiments that the proposed framework significantly outperforms the state-of-the-art recognition performance on a challenging first-person daily activity benchmark.
ISSN:1051-8215
1558-2205
DOI:10.1109/TCSVT.2017.2716819