Relational Reasoning for Group Activity Recognition via Self-Attention Augmented Conditional Random Field

This paper presents a new relational network for group activity recognition. The essence of the network is to integrate conditional random fields (CRFs) with self-attention to infer the temporal dependencies and spatial relationships of the actors. This combination can take advantage of the capabili...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on image processing 2021, Vol.30, p.8184-8199
Hauptverfasser: Pramono, Rizard Renanda Adhi, Fang, Wen-Hsien, Chen, Yie-Tarng
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:This paper presents a new relational network for group activity recognition. The essence of the network is to integrate conditional random fields (CRFs) with self-attention to infer the temporal dependencies and spatial relationships of the actors. This combination can take advantage of the capability of CRFs in modelling the actors' features that depend on each other and the capability of self-attention in learning the temporal evolution and spatial relational contexts of every actor in videos. Additionally, there are two distinct facets of our CRF and self-attention. First, the pairwise energy of the new CRF relies on both of the temporal self-attention and spatial self-attention, which apply the self-attention mechanism to the features in time and space, respectively. Second, to address both local and non-local relationships in group activities, the spatial self-attention takes account of a collection of cliques with different scales of spatial locality. The associated mean-field inference thereafter can thus be reformulated as a self-attention network to generate the relational contexts of the actors and their individual action labels. Lastly, a bidirectional universal transformer encoder (UTE) is utilized to aggregate the forward and backward temporal context information, scene information and relational contexts for group activity recognition. A new loss function is also employed, consisting of not only the cost for the classification of individual actions and group activities, but also a contrastive loss to address the miscellaneous relational contexts between actors. Simulations show that the new approach can surpass previous works on four commonly used datasets.
ISSN:1057-7149
1941-0042
DOI:10.1109/TIP.2021.3113570