Deepfake Video Detection with Spatiotemporal Dropout Transformer
While the abuse of deepfake technology has caused serious concerns recently, how to detect deepfake videos is still a challenge due to the high photo-realistic synthesis of each frame. Existing image-level approaches often focus on single frame and ignore the spatiotemporal cues hidden in deepfake v...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | While the abuse of deepfake technology has caused serious concerns recently,
how to detect deepfake videos is still a challenge due to the high
photo-realistic synthesis of each frame. Existing image-level approaches often
focus on single frame and ignore the spatiotemporal cues hidden in deepfake
videos, resulting in poor generalization and robustness. The key of a
video-level detector is to fully exploit the spatiotemporal inconsistency
distributed in local facial regions across different frames in deepfake videos.
Inspired by that, this paper proposes a simple yet effective patch-level
approach to facilitate deepfake video detection via spatiotemporal dropout
transformer. The approach reorganizes each input video into bag of patches that
is then fed into a vision transformer to achieve robust representation.
Specifically, a spatiotemporal dropout operation is proposed to fully explore
patch-level spatiotemporal cues and serve as effective data augmentation to
further enhance model's robustness and generalization ability. The operation is
flexible and can be easily plugged into existing vision transformers. Extensive
experiments demonstrate the effectiveness of our approach against 25
state-of-the-arts with impressive robustness, generalizability, and
representation ability. |
---|---|
DOI: | 10.48550/arxiv.2207.06612 |