TSCANet: a two-stream context aggregation network for weakly-supervised temporal action localization

Weakly supervised temporal action localization classifies and localizes actions in uncropped videos by using only video-level labels. Many current methods employ feature extractors initially intended for post-cropped video action classification. The accuracy of localization decreases when feature ex...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	The Journal of supercomputing 2025-01, Vol.81 (1), Article 311
Hauptverfasser:	Zhang, Haiping, Lin, Haixiang, Wang, Dongjing, Xu, Dongyang, Zhou, Fuxing, Guan, Liming, Yu, Dongjing, Fang, Xujian
Format:	Artikel
Sprache:	eng
Schlagworte:	Context Feature extraction Localization Modules Optical flow (image analysis) Video data
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Weakly supervised temporal action localization classifies and localizes actions in uncropped videos by using only video-level labels. Many current methods employ feature extractors initially intended for post-cropped video action classification. The accuracy of localization decreases when feature extractors of this type are used, because they may introduce redundant information into the action localization task. To overcome the aforementioned constraints, we propose a WSTAL technique based on the two-stream context aggregation network (TSCANet), which consists of two main modules: a multistage temporal feature aggregation module (MSTFA) and a feature alignment module (FA). The MSTFA enables TSCANet to rapidly expand the receptive field and acquire temporal dependencies between long-distance segments by stacking dilated convolutional layers. Therefore, MSTFA allows the model to better aggregate temporal information in optical flow features to reduce redundant information in the original features. To avoid inconsistencies between the enhanced optical flow and RGB flow features, this study designed an FA to calibrate RGB features using optimized optical flow features through a mutual learning approach. On THUMOS14 and ActivityNet datasets, many comparative tests are carried out, and an improved localization performance is attained. In particular, localization at low t-IoU thresholds outperforms many of the existing WSTAL methods.
ISSN:	0920-8542 1573-0484
DOI:	10.1007/s11227-024-06810-6