A Unified Masked Autoencoder with Patchified Skeletons for Motion Synthesis
The synthesis of human motion has traditionally been addressed through task-dependent models that focus on specific challenges, such as predicting future motions or filling in intermediate poses conditioned on known key-poses. In this paper, we present a novel task-independent model called UNIMASK-M...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The synthesis of human motion has traditionally been addressed through
task-dependent models that focus on specific challenges, such as predicting
future motions or filling in intermediate poses conditioned on known key-poses.
In this paper, we present a novel task-independent model called UNIMASK-M,
which can effectively address these challenges using a unified architecture.
Our model obtains comparable or better performance than the state-of-the-art in
each field. Inspired by Vision Transformers (ViTs), our UNIMASK-M model
decomposes a human pose into body parts to leverage the spatio-temporal
relationships existing in human motion. Moreover, we reformulate various
pose-conditioned motion synthesis tasks as a reconstruction problem with
different masking patterns given as input. By explicitly informing our model
about the masked joints, our UNIMASK-M becomes more robust to occlusions.
Experimental results show that our model successfully forecasts human motion on
the Human3.6M dataset. Moreover, it achieves state-of-the-art results in motion
inbetweening on the LaFAN1 dataset, particularly in long transition periods.
More information can be found on the project website
https://evm7.github.io/UNIMASKM-page/ |
---|---|
DOI: | 10.48550/arxiv.2308.07301 |