Learning Higher-order Object Interactions for Keypoint-based Video Understanding
Action recognition is an important problem that requires identifying actions in video by learning complex interactions across scene actors and objects. However, modern deep-learning based networks often require significant computation, and may capture scene context using various modalities that furt...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Action recognition is an important problem that requires identifying actions
in video by learning complex interactions across scene actors and objects.
However, modern deep-learning based networks often require significant
computation, and may capture scene context using various modalities that
further increases compute costs. Efficient methods such as those used for AR/VR
often only use human-keypoint information but suffer from a loss of scene
context that hurts accuracy. In this paper, we describe an action-localization
method, KeyNet, that uses only the keypoint data for tracking and action
recognition. Specifically, KeyNet introduces the use of object based keypoint
information to capture context in the scene. Our method illustrates how to
build a structured intermediate representation that allows modeling
higher-order interactions in the scene from object and human keypoints without
using any RGB information. We find that KeyNet is able to track and classify
human actions at just 5 FPS. More importantly, we demonstrate that object
keypoints can be modeled to recover any loss in context from using keypoint
information over AVA action and Kinetics datasets. |
---|---|
DOI: | 10.48550/arxiv.2305.09539 |