Sapiens: Foundation for Human Vision Models
We present Sapiens, a family of models for four fundamental human-centric vision tasks -- 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. Our models natively support 1K high-resolution inference and are extremely easy to adapt for individual tasks by simp...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We present Sapiens, a family of models for four fundamental human-centric
vision tasks -- 2D pose estimation, body-part segmentation, depth estimation,
and surface normal prediction. Our models natively support 1K high-resolution
inference and are extremely easy to adapt for individual tasks by simply
fine-tuning models pretrained on over 300 million in-the-wild human images. We
observe that, given the same computational budget, self-supervised pretraining
on a curated dataset of human images significantly boosts the performance for a
diverse set of human-centric tasks. The resulting models exhibit remarkable
generalization to in-the-wild data, even when labeled data is scarce or
entirely synthetic. Our simple model design also brings scalability -- model
performance across tasks improves as we scale the number of parameters from 0.3
to 2 billion. Sapiens consistently surpasses existing baselines across various
human-centric benchmarks. We achieve significant improvements over the prior
state-of-the-art on Humans-5K (pose) by 7.6 mAP, Humans-2K (part-seg) by 17.1
mIoU, Hi4D (depth) by 22.4% relative RMSE, and THuman2 (normal) by 53.5%
relative angular error. Project page:
https://about.meta.com/realitylabs/codecavatars/sapiens. |
---|---|
DOI: | 10.48550/arxiv.2408.12569 |