Spatio-temporal graph-based CNNs for anomaly detection in weakly-labeled videos

•Spatial similarity graph and temporal consistency graph are constructed, and the attention mechanism is introduced to allocate attention for each segment.•A novel spatial-temporal fusion graph module is proposed to capture the corresponding identifying information synchronously, and long-range spat...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Information processing & management 2022-07, Vol.59 (4), p.102983, Article 102983
Hauptverfasser:	Mu, Huiyu, Sun, Ruizhi, Wang, Miao, Chen, Zeqiu
Format:	Artikel
Sprache:	eng
Schlagworte:	Anomaly detection Attention mechanism Graph convolutional networks Spatio-temporal features Weakly supervised learning
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	•Spatial similarity graph and temporal consistency graph are constructed, and the attention mechanism is introduced to allocate attention for each segment.•A novel spatial-temporal fusion graph module is proposed to capture the corresponding identifying information synchronously, and long-range spatial-temporal dependencies could also be extracted with layers stacked.•We formulate a ranking loss which encourages the STGCNs to pay attention to the context around the anomalous part, and a classification loss to adapt the variation of raw videos and less abnormal events.•We evaluate the proposed method on several anomaly detection benchmarks and it achieves excellent performance as compared to the state-of-the-art anomaly detection works. Abnormal event detection in videos plays an essential role for public security. However, most weakly supervised learning methods ignore the relationship between the complicated spatial correlations and the dynamical trends of temporal pattern in video data. In this paper, we provide a new perspective, i.e., spatial similarity and temporal consistency are adopted to construct Spatio-Temporal Graph-based CNNs (STGCNs). For the feature extraction, we use Inflated 3D (I3D) convolutional networks to extract features which can better capture appearance and motion dynamics in videos. For the spatio graph and temporal graph, each video segment is regarded as a vertex in the graph, and attention mechanism is introduced to allocate attention for each segment. For the spatial-temporal fusion graph, we propose a self-adapting weighting to fuse them. Finally, we build ranking loss and classification loss to improve the robustness of STGCNs. We evaluate the performance of STGCNs on UCF-Crime datasets (total 128 h) and ShanghaiTech datasets (total 317,398 frames) with the AUC score 84.2% and 92.3%, respectively. The experimental results also show the effectiveness and robustness with other evaluation metrics.
ISSN:	0306-4573 1873-5371
DOI:	10.1016/j.ipm.2022.102983