Spatial–temporal injection network: exploiting auxiliary losses for action recognition with apparent difference and self-attention

Two-stream convolutional networks have shown strong performance in action recognition. However, both spatial and temporal features in two-stream are learned separately. There has been almost no consideration for the different characteristics of the spatial and temporal streams, which are performed o...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Signal, image and video processing image and video processing, 2023-06, Vol.17 (4), p.1173-1180
Hauptverfasser:	Cao, Haiwen, Wu, Chunlei, Lu, Jing, Wu, Jie, Wang, Leiquan
Format:	Artikel
Sprache:	eng
Schlagworte:	Activity recognition Computer Imaging Computer Science Constraint modelling Image Processing and Computer Vision Multimedia Information Systems Optical flow (image analysis) Original Paper Pattern Recognition and Graphics Signal,Image and Speech Processing Streams Vision
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Two-stream convolutional networks have shown strong performance in action recognition. However, both spatial and temporal features in two-stream are learned separately. There has been almost no consideration for the different characteristics of the spatial and temporal streams, which are performed on the same operations. In this paper, we build upon two-stream convolutional networks and propose a novel spatial–temporal injection network (STIN) with two different auxiliary losses. To build spatial–temporal features as the video representation, the apparent difference module is designed to model the auxiliary temporal constraints on spatial features in spatial injection network. The self-attention mechanism is used to attend to the interested areas in the temporal injection stream, which reduces the optical flow noise influence of uninterested region. Then, these auxiliary losses enable efficient training of two complementary streams which can capture interactions between the spatial and temporal information from different perspectives. Experiments conducted on the two well-known datasets—UCF101 and HMDB51—demonstrate the effectiveness of the proposed STIN.
ISSN:	1863-1703 1863-1711
DOI:	10.1007/s11760-022-02324-x