HolisticDFD: Infusing spatiotemporal transformer embeddings for deepfake detection

•A novel multi-dimensional model infused deepfake detection method is proposed.•Models are pre-trained to independently learn a single inconsistency dimension.•The method takes a holistic view by fusing embeddings intelligently.•A compact transformer is used with fewer parameters than state-of-the-a...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Information sciences 2023-10, Vol.645, p.119352, Article 119352
Hauptverfasser:	Anas Raza, Muhammad, Mahmood Malik, Khalid, Ul Haq, Ijaz
Format:	Artikel
Sprache:	eng
Schlagworte:	Deepfake Detection Intermediate Fusion Multimedia Forensics Transformers
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	•A novel multi-dimensional model infused deepfake detection method is proposed.•Models are pre-trained to independently learn a single inconsistency dimension.•The method takes a holistic view by fusing embeddings intelligently.•A compact transformer is used with fewer parameters than state-of-the-art methods. Deepfakes, or synthetic audiovisual media developed with the intent to deceive, are growing increasingly prevalent. Existing methods, employed independently as images/patches or jointly as tubelets, have, up to this point, typically focused on spatial or spatiotemporal inconsistencies. However, the evolving nature of deepfakes demands a holistic approach. Inspection of a given multimedia sample with the intent to validate its authenticity, without adding significant computational overhead has, to date, not been fully explored in the literature. In addition, no work has been done on the impact of different inconsistency dimensions in a single framework. This paper tackles the deepfake detection problem holistically. HolisticDFD, a novel, transformer-based, deepfake detection method which is both lightweight and compact, intelligently combines embeddings from the spatial, temporal and spatiotemporal dimensions to separate deepfakes from bonafide videos. The proposed framework achieves 0.926 AUC on the DFDC dataset using just 3% of the parameters used by state-of-the-art detectors. An evaluation against other datasets shows the efficacy of the proposed framework, and an ablation study shows that the performance of the system gradually improves as embeddings with different data representations are combined. An implementation of the proposed framework is available at: https://github.com/smileslab/deepfake-detection/.
ISSN:	0020-0255
DOI:	10.1016/j.ins.2023.119352