Temporal Action Localization With Coarse-to-Fine Network

Precisely localizing temporal intervals for each action segment in long raw videos is essential challenge in practical video content analysis (e.g., activity detection or video caption generation). Most of previous works often neglect the hierarchical action granularity and eventually fail to identi...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE access 2022, Vol.10, p.96378-96387
Hauptverfasser: Zhang, Min, Hu, Haiyang, Li, Zhongjin
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Precisely localizing temporal intervals for each action segment in long raw videos is essential challenge in practical video content analysis (e.g., activity detection or video caption generation). Most of previous works often neglect the hierarchical action granularity and eventually fail to identify precise action boundaries. (e.g., embracing approaching or turning a screw in mechanical maintenance). In this paper, we introduce a simple yet efficient coarse-to-fine network (CFNet) to solve the challenging issue of temporal action localization by progressively refining action boundary at multiple action granularities. The proposed CFNet is mainly composed of three components: a coarse proposal module (CPM) to generate coarse action candidates, a fusion block (FB) to enhance feature representation by fusing the coarse candidate features and corresponding features of raw input frames, and a boundary transformer module (BTM) to further refine action boundaries. Specifically, CPM exploits framewise, matching and gated actionness curves to complement each other for coarse candidate generation at different levels, while FB is devised to enrich feature representation by fusing the last feature map of CPM and corresponding raw frame input. Finally, BTM learns long-term temporal dependency with a transformer structure to further refine action boundaries at a finer granularity. Thus, the fine-grained action intervals can be incrementally obtained. Compared with previous state-of-the-art techniques, the proposed coarse-to-fine network can asymptotically approach fine-grained action boundary. Comprehensive experiments are conducted on both publicly available THUMOS14 and ActivityNet-v1.3 datasets, and show the outstanding improvements of our method when compared with the prior methods on various video action parsing tasks.
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2022.3205594