Uni-AdaFocus: Spatial-Temporal Dynamic Computation for Video Recognition

This paper presents a comprehensive exploration of the phenomenon of data redundancy in video understanding, with the aim to improve computational efficiency. Our investigation commences with an examination of spatial redundancy , which refers to the observation that the most informative region in e...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence 2024-12, p.1-18
Hauptverfasser: Wang, Yulin, Zhang, Haoji, Yue, Yang, Song, Shiji, Deng, Chao, Feng, Junlan, Huang, Gao
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:This paper presents a comprehensive exploration of the phenomenon of data redundancy in video understanding, with the aim to improve computational efficiency. Our investigation commences with an examination of spatial redundancy , which refers to the observation that the most informative region in each video frame usually corresponds to a small image patch, whose shape, size and location shift smoothly across frames. Motivated by this phenomenon, we formulate the patch localization problem as a dynamic decision task, and introduce a spatially adaptive video recognition approach, termed AdaFocus. In specific, a lightweight encoder is first employed to quickly process the full video sequence, whose features are then utilized by a policy network to identify the most task-relevant regions. Subsequently, the selected patches are inferred by a high-capacity deep network for the final prediction. The complete model can be trained conveniently in an end-to-end manner. During inference, once the informative patch sequence has been generated, the bulk of computation can be executed in parallel, rendering it efficient on modern GPU devices. Furthermore, we demonstrate that AdaFocus can be easily extended by further considering the temporal and sample- wise redundancies, i.e. , allocating the majority of computation to the most task-relevant video frames, and minimizing the computation spent on relatively "easier" videos. Our resulting algorithm, Uni-AdaFocus, establishes a comprehensive framework that seamlessly integrates spatial, temporal, and sample- wise dynamic computation, while it preserves the merits of AdaFocus in terms of efficient end-to-end training and hardware friendliness. In addition, Uni-AdaFocus is general and flexible as it is compatible with off-the-shelf backbone models ( e.g. , TSM and X3D), which can be readily deployed as our feature extractor, yielding a significantly improved computational efficiency. Empirically, extensive experiments based on seven widely-used benchmark datasets ( i.e. , ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2, Jester, and Kinetics-400) and three real-world application scenarios ( i.e. , fine-grained diving action classification, Alzheimer's and Parkinson's diseases diagnosis with brain magnetic resonance images (MRI), and violence recognition for online videos) substantiate that Uni-AdaFocus is considerably more efficient than the competitive baselines. Code and pre-trained models are available at htt
ISSN:0162-8828
2160-9292
DOI:10.1109/TPAMI.2024.3514654