Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content
Existing DeepFake detection techniques primarily focus on facial manipulations, such as face-swapping or lip-syncing. However, advancements in text-to-video (T2V) and image-to-video (I2V) generative models now allow fully AI-generated synthetic content and seamless background alterations, challengin...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Existing DeepFake detection techniques primarily focus on facial
manipulations, such as face-swapping or lip-syncing. However, advancements in
text-to-video (T2V) and image-to-video (I2V) generative models now allow fully
AI-generated synthetic content and seamless background alterations, challenging
face-centric detection methods and demanding more versatile approaches.
To address this, we introduce the \underline{U}niversal \underline{N}etwork
for \underline{I}dentifying \underline{T}ampered and synth\underline{E}tic
videos (\texttt{UNITE}) model, which, unlike traditional detectors, captures
full-frame manipulations. \texttt{UNITE} extends detection capabilities to
scenarios without faces, non-human subjects, and complex background
modifications. It leverages a transformer-based architecture that processes
domain-agnostic features extracted from videos via the SigLIP-So400M foundation
model. Given limited datasets encompassing both facial/background alterations
and T2V/I2V content, we integrate task-irrelevant data alongside standard
DeepFake datasets in training. We further mitigate the model's tendency to
over-focus on faces by incorporating an attention-diversity (AD) loss, which
promotes diverse spatial attention across video frames. Combining AD loss with
cross-entropy improves detection performance across varied contexts.
Comparative evaluations demonstrate that \texttt{UNITE} outperforms
state-of-the-art detectors on datasets (in cross-data settings) featuring
face/background manipulations and fully synthetic T2V/I2V videos, showcasing
its adaptability and generalizable detection capabilities. |
---|---|
DOI: | 10.48550/arxiv.2412.12278 |