Multi-order spatial interaction network for human pose estimation

Recent vision Transformer has been applied to human pose estimation and has achieved excellent performance by two-order spatial interaction with self-attention. However, it is still unclear whether higher-order spatial interaction can facilitate pose estimation. In this paper, we propose a novel app...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Digital signal processing 2023-10, Vol.142, p.104219, Article 104219
Hauptverfasser: Wang, Dong, Xie, Wenjun, Cai, Youcheng, Li, Xinjie, Liu, Xiaoping
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Recent vision Transformer has been applied to human pose estimation and has achieved excellent performance by two-order spatial interaction with self-attention. However, it is still unclear whether higher-order spatial interaction can facilitate pose estimation. In this paper, we propose a novel approach based on multi-order spatial interactions and confirm that the combination of different orders is beneficial for human pose estimation task. We first build a Triple Interaction Module (TIM) by pure convolutions to make spatial information interactions three times. In contrast to Transformer, the TIM is compatible with several pure convolutions and extends two-order interaction in Transformer to triple-order without extensive additional computation, which makes it easier to explore inter-related features between keypoints in the human body. In addition, we combine TIM with traditional CNN and Transformer to form Multi-order Spatial Interaction Network (MSIN). This paper takes advantage of MSIN to extract keypoint heatmaps and certifies that the order-by-order structure can enhance the overall performance of locating human keypoints. Experimental results demonstrate that MSIN performs favorably against the most state-of-the-art CNN-based and Transformer-based counterparts on the COCO and MPII datasets, while being more lightweight. •The Triple Interaction Module can be obtained by pure convolutions to make spatial information interactions three times.•Combining the Triple Interaction Module with traditional CNN and Transformer can improve the performance of locating keypoints.•The Triple Interaction Module can effectively better to explore inter-related features between keypoints in the human body.•The order-by-order structure can enhance the overall performance of locating human keypoints.
ISSN:1051-2004
DOI:10.1016/j.dsp.2023.104219