PEMF-VVTO: Point-Enhanced Video Virtual Try-on via Mask-free Paradigm
Video Virtual Try-on aims to fluently transfer the garment image to a semantically aligned try-on area in the source person video. Previous methods leveraged the inpainting mask to remove the original garment in the source video, thus achieving accurate garment transfer on simple model videos. Howev...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Video Virtual Try-on aims to fluently transfer the garment image to a
semantically aligned try-on area in the source person video. Previous methods
leveraged the inpainting mask to remove the original garment in the source
video, thus achieving accurate garment transfer on simple model videos.
However, when these methods are applied to realistic video data with more
complex scene changes and posture movements, the overly large and incoherent
agnostic masks will destroy the essential spatial-temporal information of the
original video, thereby inhibiting the fidelity and coherence of the try-on
video. To alleviate this problem, we propose a novel point-enhanced mask-free
video virtual try-on framework (PEMF-VVTO). Specifically, we first leverage the
pre-trained mask-based try-on model to construct large-scale paired training
data (pseudo-person samples). Training on these mask-free data enables our
model to perceive the original spatial-temporal information while realizing
accurate garment transfer. Then, based on the pre-acquired sparse frame-cloth
and frame-frame point alignments, we design the point-enhanced spatial
attention (PSA) and point-enhanced temporal attention (PTA) to further improve
the try-on accuracy and video coherence of the mask-free model. Concretely, PSA
explicitly guides the garment transfer to desirable locations through the
sparse semantic alignments of video frames and cloth. PTA exploits the temporal
attention on sparse point correspondences to enhance the smoothness of
generated videos. Extensive qualitative and quantitative experiments clearly
illustrate that our PEMF-VVTO can generate more natural and coherent try-on
videos than existing state-of-the-art methods. |
---|---|
DOI: | 10.48550/arxiv.2412.03021 |