Motion Transformer for Unsupervised Image Animation
Image animation aims to animate a source image by using motion learned from a driving video. Current state-of-the-art methods typically use convolutional neural networks (CNNs) to predict motion information, such as motion keypoints and corresponding local transformations. However, these CNN based m...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Image animation aims to animate a source image by using motion learned from a
driving video. Current state-of-the-art methods typically use convolutional
neural networks (CNNs) to predict motion information, such as motion keypoints
and corresponding local transformations. However, these CNN based methods do
not explicitly model the interactions between motions; as a result, the
important underlying motion relationship may be neglected, which can
potentially lead to noticeable artifacts being produced in the generated
animation video. To this end, we propose a new method, the motion transformer,
which is the first attempt to build a motion estimator based on a vision
transformer. More specifically, we introduce two types of tokens in our
proposed method: i) image tokens formed from patch features and corresponding
position encoding; and ii) motion tokens encoded with motion information. Both
types of tokens are sent into vision transformers to promote underlying
interactions between them through multi-head self attention blocks. By adopting
this process, the motion information can be better learned to boost the model
performance. The final embedded motion tokens are then used to predict the
corresponding motion keypoints and local transformations. Extensive experiments
on benchmark datasets show that our proposed method achieves promising results
to the state-of-the-art baselines. Our source code will be public available. |
---|---|
DOI: | 10.48550/arxiv.2209.14024 |