FTCF: Full temporal cross fusion network for violence detection in videos

Automatic violence detection in video is a meaningful yet challenging task. Violent actions can be characterized both by intense sequential frames and by continuous spatial moves, imposing more complexity than other human actions. However, most existing approaches focus on general spatiotemporal fea...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Applied intelligence (Dordrecht, Netherlands) Netherlands), 2023-02, Vol.53 (4), p.4218-4230
Hauptverfasser: Zhenhua, Tan, Zhenche, Xia, Pengfei, Wang, Chang, Ding, Weichao, Zhai
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Automatic violence detection in video is a meaningful yet challenging task. Violent actions can be characterized both by intense sequential frames and by continuous spatial moves, imposing more complexity than other human actions. However, most existing approaches focus on general spatiotemporal features with local convolution and ignore the full temporal inference based on violence characteristics. To this end, we propose a novel full temporal cross fusion network (FTCF Net) to investigate an effective inference way for violence detection. Specifically, we design two essential components in each FTCF block: a spatial processor and a temporal processor by neural networks. The former is to capture the local structural features of each frame by a 3D CNN with a (3×3×1) filter to infer the continuous spatial moves, while the latter is to perform the cross-frame feature interaction step by step for each channel by a group of processing units to infer the intense and wide variation of violence in full temporal. The two branches are fused at the end of each FTCF block in the FTCF Net efficiently. We conduct extensive experiments on four benchmark datasets: Hockey Fight, Movie Fight, Violent Flow, and Real-life Violence Situations, and the experimental results show that FTCF Net outperforms 20 comparison methods in terms of predictive accuracy. The accuracy goes up to 99.5%, 100.0%, 98.0% and 98.5% in the four datasets respectively, validating the effectiveness of our proposed approach for violence detection. Moreover, the approach proposed in this paper obtains relative steady prediction performance superior to existing methods under different scale of training sets. We hope this work to be a baseline of violence detection, and the whole original codes and pre-trained weights are publicly available at https://github.com/TAN-OpenLab/FTCF-NET .
ISSN:0924-669X
1573-7497
DOI:10.1007/s10489-022-03708-9