Learning Natural Consistency Representation for Face Forgery Video Detection
Face Forgery videos have elicited critical social public concerns and various detectors have been proposed. However, fully-supervised detectors may lead to easily overfitting to specific forgery methods or videos, and existing self-supervised detectors are strict on auxiliary tasks, such as requirin...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Face Forgery videos have elicited critical social public concerns and various
detectors have been proposed. However, fully-supervised detectors may lead to
easily overfitting to specific forgery methods or videos, and existing
self-supervised detectors are strict on auxiliary tasks, such as requiring
audio or multi-modalities, leading to limited generalization and robustness. In
this paper, we examine whether we can address this issue by leveraging
visual-only real face videos. To this end, we propose to learn the Natural
Consistency representation (NACO) of real face videos in a self-supervised
manner, which is inspired by the observation that fake videos struggle to
maintain the natural spatiotemporal consistency even under unknown forgery
methods and different perturbations. Our NACO first extracts spatial features
of each frame by CNNs then integrates them into Transformer to learn the
long-range spatiotemporal representation, leveraging the advantages of CNNs and
Transformer on local spatial receptive field and long-term memory respectively.
Furthermore, a Spatial Predictive Module~(SPM) and a Temporal Contrastive
Module~(TCM) are introduced to enhance the natural consistency representation
learning. The SPM aims to predict random masked spatial features from
spatiotemporal representation, and the TCM regularizes the latent distance of
spatiotemporal representation by shuffling the natural order to disturb the
consistency, which could both force our NACO more sensitive to the natural
spatiotemporal consistency. After the representation learning stage, a MLP head
is fine-tuned to perform the usual forgery video classification task. Extensive
experiments show that our method outperforms other state-of-the-art competitors
with impressive generalization and robustness. |
---|---|
DOI: | 10.48550/arxiv.2407.10550 |