Learning Heatmap-Style Jigsaw Puzzles Provides Good Pretraining for 2D Human Pose Estimation
The target of 2D human pose estimation is to locate the keypoints of body parts from input 2D images. State-of-the-art methods for pose estimation usually construct pixel-wise heatmaps from keypoints as labels for learning convolution neural networks, which are usually initialized randomly or using...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The target of 2D human pose estimation is to locate the keypoints of body
parts from input 2D images. State-of-the-art methods for pose estimation
usually construct pixel-wise heatmaps from keypoints as labels for learning
convolution neural networks, which are usually initialized randomly or using
classification models on ImageNet as their backbones. We note that 2D pose
estimation task is highly dependent on the contextual relationship between
image patches, thus we introduce a self-supervised method for pretraining 2D
pose estimation networks. Specifically, we propose Heatmap-Style Jigsaw Puzzles
(HSJP) problem as our pretext-task, whose target is to learn the location of
each patch from an image composed of shuffled patches. During our pretraining
process, we only use images of person instances in MS-COCO, rather than
introducing extra and much larger ImageNet dataset. A heatmap-style label for
patch location is designed and our learning process is in a non-contrastive
way. The weights learned by HSJP pretext task are utilised as backbones of 2D
human pose estimator, which are then finetuned on MS-COCO human keypoints
dataset. With two popular and strong 2D human pose estimators, HRNet and
SimpleBaseline, we evaluate mAP score on both MS-COCO validation and test-dev
datasets. Our experiments show that downstream pose estimators with our
self-supervised pretraining obtain much better performance than those trained
from scratch, and are comparable to those using ImageNet classification models
as their initial backbones. |
---|---|
DOI: | 10.48550/arxiv.2012.07101 |