Local and global context cooperation for temporal action detection: Local and Global Context Cooperation for Temporal Action Detection
Temporal action detection (TAD) is a fundamental task for video understanding. The task aims to locate the start and end boundaries of action instances and identify their corresponding categories within untrimmed videos. Distinguishing between similar actions in a video is still a difficult task. To...
Gespeichert in:
Veröffentlicht in: | Multimedia systems 2024-12, Vol.30 (6), Article 334 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Temporal action detection (TAD) is a fundamental task for video understanding. The task aims to locate the start and end boundaries of action instances and identify their corresponding categories within untrimmed videos. Distinguishing between similar actions in a video is still a difficult task. To enable fine-grained differentiation of temporal localization and action classification, it is necessary to introduce more temporal cues when building visual representations of similar actions. To address this issue, We propose a new method called the local and global context cooperation (LGCC), which aims to construct discriminative visual representations by combining short-term, medium-term, and long-term dependencies. The LGCC method comprises two main components: a local relation module and a global relation module. Specifically, we design a novel short-term and medium-term temporal context aggregation module (SMTCA). The aim of this module is to capture local context cues within an action instance to construct short-term context dependencies, and uses different dilation rates to expand the scope of information collection to establish medium-term context dependencies. The local relation module consists of multiple SMTCAs, which are used to obtain more temporal cues for fine-grained modeling. The global relation module employs multi-head self-attention to capture complex long-term context dependencies. We also design the LRGR module, combining local and global relation modules to create more expressive temporal features, improving action classification and boundary detection. Extensive experiments are conducted on the THUMOS14, ActivityNet1.3, and EPIC-Kitchens 100 datasets. LGCC achieves an average mAP of 68.9% on THUMOS14 and 36.6% on ActivityNet1.3. For the EPIC-Kitchens 100, the average mAP performance on the verb and noun tasks is 25.2% and 23.3%, respectively. The results show that LGCC achieves state-of-the-art performance. |
---|---|
ISSN: | 0942-4962 1432-1882 |
DOI: | 10.1007/s00530-024-01511-9 |