Efficient spatiotemporal context modeling for action recognition

•We extend the 2D criss-cross attention to 3D, which gives its ability to model sparse context in spatiotemporal space. Compared to non-local attention, the complexity of CCA-3D for spatiotemporal context modeling is greatly reduced, and hence the computational and memory burden is much lower.•We pr...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Neurocomputing (Amsterdam) 2023-08, Vol.545, p.126289, Article 126289
Hauptverfasser: Cao, Congqi, Lu, Yue, Zhang, Yifan, Jiang, Dongmei, Zhang, Yanning
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•We extend the 2D criss-cross attention to 3D, which gives its ability to model sparse context in spatiotemporal space. Compared to non-local attention, the complexity of CCA-3D for spatiotemporal context modeling is greatly reduced, and hence the computational and memory burden is much lower.•We propose to stack CCA-3Ds and devise a novel recurrent structure that can leverage the appearance for dense spatiotemporal context modeling. The proposed RCCA-3D structure addresses the inability of the original RCCA-2D structure to model the entire spatiotemporal context. It is more suitable for action recognition than the directly extended 3D version of RCCA-2D.•We conduct extensive experiments with 3 backbones on 5 RGB-based and skeleton-based datasets to comprehensively verify the effectiveness of our method. All of the backbones equipped with RCCA-3D achieve better and leading performance on those datasets. Contextual information is essential in action recognition. However, local operations have difficulty in modeling two distant elements, and directly computing the dense relations between any two points brings huge computation and memory burden. Inspired by the recurrent 2D criss-cross attention (RCCA-2D) in image segmentation, we propose a recurrent 3D criss-cross attention (RCCA-3D) that factorizes the global relation map into sparse relation maps to model long-range spatiotemporal context with minor costs for video-based action recognition. Specifically, we first propose a 3D criss-cross attention (CCA-3D) module. Compared with the CCA-2D which only works in space, it can capture the spatiotemporal relationship between the points in the same line along the direction of width, height and time. However, only replacing the two CCA-2Ds in the RCCA-2D with our CCA-3Ds cannot model the spatiotemporal context in videos. Therefore, we further duplicate the CCA-3D with a recurrent mechanism to transmit the relation between the points in a line to a plane and finally to the whole spatiotemporal space. To make the RCCA-3D adaptive for action recognition, we propose a novel recurrent structure rather than directly extending the original 2D structure to 3D. In the experiments, we make a thorough analysis of different structures of RCCA-3D, verifying the proposed structure is more suitable for action recognition. We also compare our RCCA-3D with the non-local attention, showing that the RCCA-3D requires 25% fewer parameters and 30% fewer FLOPs with even higher accuracy.
ISSN:0925-2312
1872-8286
DOI:10.1016/j.neucom.2023.126289