DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition
Human action recognition has recently become one of the popular research topics in the computer vision community. Various 3D-CNN based methods have been presented to tackle both the spatial and temporal dimensions in the task of video action recognition with competitive results. However, these metho...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Human action recognition has recently become one of the popular research
topics in the computer vision community. Various 3D-CNN based methods have been
presented to tackle both the spatial and temporal dimensions in the task of
video action recognition with competitive results. However, these methods have
suffered some fundamental limitations such as lack of robustness and
generalization, e.g., how does the temporal ordering of video frames affect the
recognition results? This work presents a novel end-to-end Transformer-based
Directed Attention (DirecFormer) framework for robust action recognition. The
method takes a simple but novel perspective of Transformer-based approach to
understand the right order of sequence actions. Therefore, the contributions of
this work are three-fold. Firstly, we introduce the problem of ordered temporal
learning issues to the action recognition problem. Secondly, a new Directed
Attention mechanism is introduced to understand and provide attentions to human
actions in the right order. Thirdly, we introduce the conditional dependency in
action sequence modeling that includes orders and classes. The proposed
approach consistently achieves the state-of-the-art (SOTA) results compared
with the recent action recognition methods, on three standard large-scale
benchmarks, i.e. Jester, Kinetics-400 and Something-Something-V2. |
---|---|
DOI: | 10.48550/arxiv.2203.10233 |