Capsule Boundary Network With 3D Convolutional Dynamic Routing for Temporal Action Detection

Temporal action detection is a challenging task in video understanding, due to the complexity of the background and rich action content impacting high-quality temporal proposals generation in untrimmed videos. Capsule networks can avoid some limitations of the invariance caused by pooling and inabil...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on circuits and systems for video technology 2022-05, Vol.32 (5), p.2962-2975
Hauptverfasser: Chen, Yaosen, Guo, Bing, Shen, Yan, Wang, Wei, Lu, Weichen, Suo, Xinhua
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Temporal action detection is a challenging task in video understanding, due to the complexity of the background and rich action content impacting high-quality temporal proposals generation in untrimmed videos. Capsule networks can avoid some limitations of the invariance caused by pooling and inability from convolutional neural networks, which can better understand the temporal relations for temporal action detection. However, because of the extremely computationally expensive procedure, capsule network is difficult to be applied to the task of temporal action detection. To address this issue, this paper proposes a novel U-shaped capsule network framework with a k-Nearest Neighbor (k-NN) mechanism of 3D convolutional dynamic routing, which we named U-BlockConvCaps. Furthermore, we build a Capsules Boundary Network (CapsBoundNet) based on U-BlockConvCaps for dense temporal action proposal generation. Specifically, the first module is one 1D convolutional layer for fusing the two-stream with RGB and optical flow video features. The sampling module further processes the fused features to generate the 2D start-end action proposal feature maps. Then, the multi-scale U-Block convolutional capsule module with 3D convolutional dynamic routing is used to process the proposal feature map. Finally, the feature maps generated from the CapsBoundNet are used to predict starting, ending, action classification, and action regression score maps, which help to capture the boundary and intersection over union features. Our work innovatively improves the dynamic routing algorithm of capsule networks and extends the use of capsule networks to the temporal action detection task for the first time in the literature. The experimental results on benchmarks THUMOS14 show that the performance of CapsBoundNet is obviously beyond the state-of-the-art methods, e.g., the mAP@tIoU = 0.3, 0.4, 0.5 on THUMOS14 are improved from 63.6% to 70.0%, 57.8% to 63.1%, 51.3% to 52.9%, respectively. We also got competitive results on the action detection dataset of ActivityNet1.3.
ISSN:1051-8215
1558-2205
DOI:10.1109/TCSVT.2021.3104226