Turns Out I'm Not Real: Towards Robust Detection of AI-Generated Videos
The impressive achievements of generative models in creating high-quality videos have raised concerns about digital integrity and privacy vulnerabilities. Recent works to combat Deepfakes videos have developed detectors that are highly accurate at identifying GAN-generated samples. However, the robu...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The impressive achievements of generative models in creating high-quality
videos have raised concerns about digital integrity and privacy
vulnerabilities. Recent works to combat Deepfakes videos have developed
detectors that are highly accurate at identifying GAN-generated samples.
However, the robustness of these detectors on diffusion-generated videos
generated from video creation tools (e.g., SORA by OpenAI, Runway Gen-2, and
Pika, etc.) is still unexplored. In this paper, we propose a novel framework
for detecting videos synthesized from multiple state-of-the-art (SOTA)
generative models, such as Stable Video Diffusion. We find that the SOTA
methods for detecting diffusion-generated images lack robustness in identifying
diffusion-generated videos. Our analysis reveals that the effectiveness of
these detectors diminishes when applied to out-of-domain videos, primarily
because they struggle to track the temporal features and dynamic variations
between frames. To address the above-mentioned challenge, we collect a new
benchmark video dataset for diffusion-generated videos using SOTA video
creation tools. We extract representation within explicit knowledge from the
diffusion model for video frames and train our detector with a CNN + LSTM
architecture. The evaluation shows that our framework can well capture the
temporal features between frames, achieves 93.7% detection accuracy for
in-domain videos, and improves the accuracy of out-domain videos by up to 16
points. |
---|---|
DOI: | 10.48550/arxiv.2406.09601 |