SAN: Structure-aware attention network for dyadic human relation recognition in images

We introduce a new dataset and method for Dyadic Human relation Recognition (DHR). DHR is a new task that concerns the recognition of the type (i.e., verb) and roles of a two-person interaction. Unlike past human action detection, our goal is to extract richer information regarding the roles of acto...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Multimedia tools and applications 2024-05, Vol.83 (16), p.46947-46966
Hauptverfasser: Kogashi, Kaen, Nobuhara, Shohei, Nishino, Ko
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:We introduce a new dataset and method for Dyadic Human relation Recognition (DHR). DHR is a new task that concerns the recognition of the type (i.e., verb) and roles of a two-person interaction. Unlike past human action detection, our goal is to extract richer information regarding the roles of actors, i.e., which subjective person is acting on which objective person. For this, we introduce the DHR-WebImages dataset which consists of a total of 22,046 images of 51 verb classes of DHR with per-image annotation of the verb and role, and also a test set for evaluating generalization capabilities, which we refer to as DHR-Generalization. We tackle DHR by introducing a novel network inspired by the hierarchical nature of cognitive human perception. At the core of the network lies a “structure-aware attention” module that weights and integrates various hierarchical visual cues associated with the DHR instance in the image. The feature hierarchy consists of three levels, namely the union, human, and joint levels, each of which extracts visual features relevant to the participants while modeling their cross-talk. We refer to this network as Structure-aware Attention Network (SAN). Experimental results show that SAN achieves accurate DHR robust to lacking visibility of actors, and outperforms past methods by 3.04 mAP on DHR-WebImages verb task.
ISSN:1573-7721
1380-7501
1573-7721
DOI:10.1007/s11042-023-17229-1