Lips Are Lying: Spotting the Temporal Inconsistency between Audio and Visual in Lip-Syncing DeepFakes
In recent years, DeepFake technology has achieved unprecedented success in high-quality video synthesis, but these methods also pose potential and severe security threats to humanity. DeepFake can be bifurcated into entertainment applications like face swapping and illicit uses such as lip-syncing f...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In recent years, DeepFake technology has achieved unprecedented success in
high-quality video synthesis, but these methods also pose potential and severe
security threats to humanity. DeepFake can be bifurcated into entertainment
applications like face swapping and illicit uses such as lip-syncing fraud.
However, lip-forgery videos, which neither change identity nor have discernible
visual artifacts, present a formidable challenge to existing DeepFake detection
methods. Our preliminary experiments have shown that the effectiveness of the
existing methods often drastically decrease or even fail when tackling
lip-syncing videos. In this paper, for the first time, we propose a novel
approach dedicated to lip-forgery identification that exploits the
inconsistency between lip movements and audio signals. We also mimic human
natural cognition by capturing subtle biological links between lips and head
regions to boost accuracy. To better illustrate the effectiveness and advances
of our proposed method, we create a high-quality LipSync dataset, AVLips, by
employing the state-of-the-art lip generators. We hope this high-quality and
diverse dataset could be well served the further research on this challenging
and interesting field. Experimental results show that our approach gives an
average accuracy of more than 95.3% in spotting lip-syncing videos,
significantly outperforming the baselines. Extensive experiments demonstrate
the capability to tackle deepfakes and the robustness in surviving diverse
input transformations. Our method achieves an accuracy of up to 90.2% in
real-world scenarios (e.g., WeChat video call) and shows its powerful
capabilities in real scenario deployment. To facilitate the progress of this
research community, we release all resources at
https://github.com/AaronComo/LipFD. |
---|---|
DOI: | 10.48550/arxiv.2401.15668 |