Automatically detecting human-object interaction by an instance part-level attention deep framework

•One significant problem in HOI detection is that similar HOIs are difficult to distinguish. We find that the fine-grained part-level image context plays a crucial role to address the problem.•We propose a part-level visual pattern estimation method to define and estimate human body parts and object...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Pattern recognition 2023-02, Vol.134, p.109110, Article 109110
Hauptverfasser: Bai, Lin, Chen, Fenglian, Tian, Yang
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•One significant problem in HOI detection is that similar HOIs are difficult to distinguish. We find that the fine-grained part-level image context plays a crucial role to address the problem.•We propose a part-level visual pattern estimation method to define and estimate human body parts and object parts.•We propose a self-attention-based deep network to learn the fine-grained image context that encodes the consistent relationships between human body parts and object parts, which is effective for better HOI detection. Automatically detecting human-object interactions (HOIs) from an image is a very important but challenging task in computer vision. One of the significant problems in HOI detection is that similar human-object interactions are difficult to distinguish. Recently, many instance-centric HOI detection schemes, based on appearance features and coarse spatial information, have been proposed. These methods, however, lack the capacity of capturing and analyzing the fine-grained context between human poses and object parts, which plays a crucial role in HOI detection. To address these problems, we propose a novel instance part-level attention deep framework for HOI detection. Specifically, our approach consists of a human/object-part detection phase and an HOI detection phase. In the former phase, a part-level visual pattern estimation model is designed for capturing the fine-grained human body parts and object parts. In the latter phase, a self-attention-based deep network is proposed to learn the visual composite around the human-object pair that implicitly expresses the consistent spatial, scale, co-occurrence, and viewpoint relationships among human body parts and object parts across images, which are effective for predicting HOI. To the best of our knowledge, we are the first to propose a framework where the fine-grained part-level mutual context of a human-object pair is extracted to improve HOI detection. By comparing our approach with state-of-the-art HOI detection methods on benchmark datasets, we demonstrated that our proposed framework outperformed the existing HOI detection methods, such as significantly improving the performance of part-level visual pattern estimation, HOI detection, and the quality of the self-attention-based deep network structure.
ISSN:0031-3203
1873-5142
DOI:10.1016/j.patcog.2022.109110