Detecting human—object interaction with multi-level pairwise feature network

Human–object interaction (HOI) detection is crucial for human-centric image understanding which aims to infer ⟨human, action, object⟩ triplets within an image. Recent studies often exploit visual features and the spatial configuration of a human–object pair in order to learn the action linking the h...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Computational Visual Media 2021-06, Vol.7 (2), p.229-239
Hauptverfasser:	Liu, Hanchao, Mu, Tai-Jiang, Huang, Xiaolei
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial Intelligence Body parts Cable television broadcasting industry Computer Graphics Computer Science Configurations Datasets deep learning Feature extraction human–object interaction detection Image Processing and Computer Vision multi-level object instance pairwise feature network Research Article Semantics Streams User Interfaces and Human Computer Interaction
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Human–object interaction (HOI) detection is crucial for human-centric image understanding which aims to infer ⟨human, action, object⟩ triplets within an image. Recent studies often exploit visual features and the spatial configuration of a human–object pair in order to learn the action linking the human and object in the pair. We argue that such a paradigm of pairwise feature extraction and action inference can be applied not only at the whole human and object instance level, but also at the part level at which a body part interacts with an object, and at the semantic level by considering the semantic label of an object along with human appearance and human–object spatial configuration, to infer the action. We thus propose a multi-level pairwise feature network (PFNet) for detecting human–object interactions. The network consists of three parallel streams to characterize HOI utilizing pairwise features at the above three levels; the three streams are finally fused to give the action prediction. Extensive experiments show that our proposed PFNet outperforms other state-of-the-art methods on the V-COCO dataset and achieves comparable results to the state-of-the-art on the HICO-DET dataset.
ISSN:	2096-0433 2096-0662
DOI:	10.1007/s41095-020-0188-2