Toward Grouping in Large Scenes With Occlusion-Aware Spatio-Temporal Transformers

Group detection, especially for large-scale scenes, has many potential applications for public safety and smart cities. Existing methods fail to cope with frequent occlusions in large-scale scenes with multiple people, and are difficult to effectively utilize spatio-temporal information. In this pap...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on circuits and systems for video technology 2024-05, Vol.34 (5), p.3919-3929
Hauptverfasser:	Zhang, Jinsong, Gu, Lingfeng, Lai, Yu-Kun, Wang, Xueyang, Li, Kun
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer vision Data mining Feature extraction Group detection large-scale scenes Occlusion Performance enhancement Public safety spatio-temporal transformers Task analysis Trajectory Transformers Video sequences
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Group detection, especially for large-scale scenes, has many potential applications for public safety and smart cities. Existing methods fail to cope with frequent occlusions in large-scale scenes with multiple people, and are difficult to effectively utilize spatio-temporal information. In this paper, we propose an end-to-end framework, GroupTransformer, for group detection in large-scale scenes. To deal with the frequent occlusions caused by multiple people, we design an occlusion encoder to detect and suppress severely occluded person crops. To explore the potential spatio-temporal relationship, we propose spatio-temporal transformers to simultaneously extract trajectory information and fuse inter-person features in a hierarchical manner. Experimental results on both large-scale and small-scale scenes demonstrate that our method achieves better performance compared with state-of-the-art methods. On large-scale scenes, our method significantly boosts the performance in terms of precision and F1 score by more than 10%. On small-scale scenes, our method still improves the performance of F1 score by more than 5%. We will release the code for research purposes.
ISSN:	1051-8215 1558-2205
DOI:	10.1109/TCSVT.2023.3324868