MINTIME: Multi-Identity Size-Invariant Video Deepfake Detection

In this paper, we present MINTIME, a video deepfake detection method that effectively captures spatial and temporal inconsistencies in videos that depict multiple individuals and varying face sizes. Unlike previous approaches that either employ simplistic a-posteriori aggregation schemes, i.e., aver...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on information forensics and security 2024, Vol.19, p.6084-6096
Hauptverfasser: Coccomini, Davide Alessandro, Zilos, Giorgos Kordopatis, Amato, Giuseppe, Caldelli, Roberto, Falchi, Fabrizio, Papadopoulos, Symeon, Gennaro, Claudio
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 6096
container_issue
container_start_page 6084
container_title IEEE transactions on information forensics and security
container_volume 19
creator Coccomini, Davide Alessandro
Zilos, Giorgos Kordopatis
Amato, Giuseppe
Caldelli, Roberto
Falchi, Fabrizio
Papadopoulos, Symeon
Gennaro, Claudio
description In this paper, we present MINTIME, a video deepfake detection method that effectively captures spatial and temporal inconsistencies in videos that depict multiple individuals and varying face sizes. Unlike previous approaches that either employ simplistic a-posteriori aggregation schemes, i.e., averaging or max operations, or only focus on the largest face in the video, our proposed method learns to accurately detect spatio-temporal inconsistencies across multiple identities in a video through a Spatio-Temporal Transformer combined with a Convolutional Neural Network backbone. This is achieved through an Identity-aware Attention mechanism that applies a masking operation on the face sequence to process each identity independently, which enables effective video-level aggregation. Furthermore, our system incorporates two novel embedding schemes: (i) the Temporal Coherent Positional Embedding, which encodes the temporal information of the face sequences of each identity, and (ii) the Size Embedding, which captures the relative sizes of the faces to the video frames. MINTIME achieves state-of-the-art performance on the ForgeryNet dataset, with a remarkable improvement of up to 14% AUC in videos containing multiple people. Moreover, it demonstrates very robust generalization capabilities in cross-forgery and cross-dataset settings. The code is publicly available at: https://github.com/davide-coccomini/MINTIME-Multi-Identity-size-iNvariant-TIMEsformer-for-Video-Deepfake-Detection .
doi_str_mv 10.1109/TIFS.2024.3409054
format Article
fullrecord <record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_ieee_primary_10547206</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10547206</ieee_id><sourcerecordid>3070781204</sourcerecordid><originalsourceid>FETCH-LOGICAL-c219t-30d4bdd831ad57e67534d9f15ef7652b31e1dd8e72322f38c2425cb3ea431eee3</originalsourceid><addsrcrecordid>eNpNkEtLAzEUhYMoWKs_QHAx4Hpqbh6TGTcifehAq4tWtyGd3IHUOlMzqVB_vSkt4uoeOOfcAx8h10AHALS4W5ST-YBRJgZc0IJKcUJ6IGWWZpTB6Z8Gfk4uum5FqRCQ5T3yMCtfFuVsfJ_Mtuvg0tJiE1zYJXP3g2nZfBvvTBOSd2exTUaIm9p8YBQBq-Da5pKc1Wbd4dXx9snbZLwYPqfT16dy-DhNKwZFSDm1YmltzsFYqTBTkgtb1CCxVplkSw4I0UbFOGM1zysmmKyWHI2IFiLvk9vD341vv7bYBb1qt76Jk5pTRVUOjIqYgkOq8m3Xeaz1xrtP43caqN5z0ntOes9JHznFzs2h4-LOv7wUitGM_wJFamJP</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3070781204</pqid></control><display><type>article</type><title>MINTIME: Multi-Identity Size-Invariant Video Deepfake Detection</title><source>IEEE Electronic Library (IEL)</source><creator>Coccomini, Davide Alessandro ; Zilos, Giorgos Kordopatis ; Amato, Giuseppe ; Caldelli, Roberto ; Falchi, Fabrizio ; Papadopoulos, Symeon ; Gennaro, Claudio</creator><creatorcontrib>Coccomini, Davide Alessandro ; Zilos, Giorgos Kordopatis ; Amato, Giuseppe ; Caldelli, Roberto ; Falchi, Fabrizio ; Papadopoulos, Symeon ; Gennaro, Claudio</creatorcontrib><description>In this paper, we present MINTIME, a video deepfake detection method that effectively captures spatial and temporal inconsistencies in videos that depict multiple individuals and varying face sizes. Unlike previous approaches that either employ simplistic a-posteriori aggregation schemes, i.e., averaging or max operations, or only focus on the largest face in the video, our proposed method learns to accurately detect spatio-temporal inconsistencies across multiple identities in a video through a Spatio-Temporal Transformer combined with a Convolutional Neural Network backbone. This is achieved through an Identity-aware Attention mechanism that applies a masking operation on the face sequence to process each identity independently, which enables effective video-level aggregation. Furthermore, our system incorporates two novel embedding schemes: (i) the Temporal Coherent Positional Embedding, which encodes the temporal information of the face sequences of each identity, and (ii) the Size Embedding, which captures the relative sizes of the faces to the video frames. MINTIME achieves state-of-the-art performance on the ForgeryNet dataset, with a remarkable improvement of up to 14% AUC in videos containing multiple people. Moreover, it demonstrates very robust generalization capabilities in cross-forgery and cross-dataset settings. The code is publicly available at: https://github.com/davide-coccomini/MINTIME-Multi-Identity-size-iNvariant-TIMEsformer-for-Video-Deepfake-Detection .</description><identifier>ISSN: 1556-6013</identifier><identifier>EISSN: 1556-6021</identifier><identifier>DOI: 10.1109/TIFS.2024.3409054</identifier><identifier>CODEN: ITIFA6</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Artificial neural networks ; computer vision ; Convolutional neural networks ; Datasets ; Deception ; deep learning ; Deepfake ; Deepfake detection ; Deepfakes ; Embedding ; Face recognition ; Faces ; Invariants ; Task analysis ; Transformers ; Vectors ; Video ; vision transformers</subject><ispartof>IEEE transactions on information forensics and security, 2024, Vol.19, p.6084-6096</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c219t-30d4bdd831ad57e67534d9f15ef7652b31e1dd8e72322f38c2425cb3ea431eee3</cites><orcidid>0000-0001-6258-5313 ; 0000-0003-3471-1196 ; 0000-0002-0755-6154 ; 0000-0003-0171-4315 ; 0000-0002-5441-7341</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10547206$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,796,4022,27921,27922,27923,54756</link.rule.ids></links><search><creatorcontrib>Coccomini, Davide Alessandro</creatorcontrib><creatorcontrib>Zilos, Giorgos Kordopatis</creatorcontrib><creatorcontrib>Amato, Giuseppe</creatorcontrib><creatorcontrib>Caldelli, Roberto</creatorcontrib><creatorcontrib>Falchi, Fabrizio</creatorcontrib><creatorcontrib>Papadopoulos, Symeon</creatorcontrib><creatorcontrib>Gennaro, Claudio</creatorcontrib><title>MINTIME: Multi-Identity Size-Invariant Video Deepfake Detection</title><title>IEEE transactions on information forensics and security</title><addtitle>TIFS</addtitle><description>In this paper, we present MINTIME, a video deepfake detection method that effectively captures spatial and temporal inconsistencies in videos that depict multiple individuals and varying face sizes. Unlike previous approaches that either employ simplistic a-posteriori aggregation schemes, i.e., averaging or max operations, or only focus on the largest face in the video, our proposed method learns to accurately detect spatio-temporal inconsistencies across multiple identities in a video through a Spatio-Temporal Transformer combined with a Convolutional Neural Network backbone. This is achieved through an Identity-aware Attention mechanism that applies a masking operation on the face sequence to process each identity independently, which enables effective video-level aggregation. Furthermore, our system incorporates two novel embedding schemes: (i) the Temporal Coherent Positional Embedding, which encodes the temporal information of the face sequences of each identity, and (ii) the Size Embedding, which captures the relative sizes of the faces to the video frames. MINTIME achieves state-of-the-art performance on the ForgeryNet dataset, with a remarkable improvement of up to 14% AUC in videos containing multiple people. Moreover, it demonstrates very robust generalization capabilities in cross-forgery and cross-dataset settings. The code is publicly available at: https://github.com/davide-coccomini/MINTIME-Multi-Identity-size-iNvariant-TIMEsformer-for-Video-Deepfake-Detection .</description><subject>Artificial neural networks</subject><subject>computer vision</subject><subject>Convolutional neural networks</subject><subject>Datasets</subject><subject>Deception</subject><subject>deep learning</subject><subject>Deepfake</subject><subject>Deepfake detection</subject><subject>Deepfakes</subject><subject>Embedding</subject><subject>Face recognition</subject><subject>Faces</subject><subject>Invariants</subject><subject>Task analysis</subject><subject>Transformers</subject><subject>Vectors</subject><subject>Video</subject><subject>vision transformers</subject><issn>1556-6013</issn><issn>1556-6021</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><recordid>eNpNkEtLAzEUhYMoWKs_QHAx4Hpqbh6TGTcifehAq4tWtyGd3IHUOlMzqVB_vSkt4uoeOOfcAx8h10AHALS4W5ST-YBRJgZc0IJKcUJ6IGWWZpTB6Z8Gfk4uum5FqRCQ5T3yMCtfFuVsfJ_Mtuvg0tJiE1zYJXP3g2nZfBvvTBOSd2exTUaIm9p8YBQBq-Da5pKc1Wbd4dXx9snbZLwYPqfT16dy-DhNKwZFSDm1YmltzsFYqTBTkgtb1CCxVplkSw4I0UbFOGM1zysmmKyWHI2IFiLvk9vD341vv7bYBb1qt76Jk5pTRVUOjIqYgkOq8m3Xeaz1xrtP43caqN5z0ntOes9JHznFzs2h4-LOv7wUitGM_wJFamJP</recordid><startdate>2024</startdate><enddate>2024</enddate><creator>Coccomini, Davide Alessandro</creator><creator>Zilos, Giorgos Kordopatis</creator><creator>Amato, Giuseppe</creator><creator>Caldelli, Roberto</creator><creator>Falchi, Fabrizio</creator><creator>Papadopoulos, Symeon</creator><creator>Gennaro, Claudio</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7TB</scope><scope>8FD</scope><scope>FR3</scope><scope>JQ2</scope><scope>KR7</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0001-6258-5313</orcidid><orcidid>https://orcid.org/0000-0003-3471-1196</orcidid><orcidid>https://orcid.org/0000-0002-0755-6154</orcidid><orcidid>https://orcid.org/0000-0003-0171-4315</orcidid><orcidid>https://orcid.org/0000-0002-5441-7341</orcidid></search><sort><creationdate>2024</creationdate><title>MINTIME: Multi-Identity Size-Invariant Video Deepfake Detection</title><author>Coccomini, Davide Alessandro ; Zilos, Giorgos Kordopatis ; Amato, Giuseppe ; Caldelli, Roberto ; Falchi, Fabrizio ; Papadopoulos, Symeon ; Gennaro, Claudio</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c219t-30d4bdd831ad57e67534d9f15ef7652b31e1dd8e72322f38c2425cb3ea431eee3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Artificial neural networks</topic><topic>computer vision</topic><topic>Convolutional neural networks</topic><topic>Datasets</topic><topic>Deception</topic><topic>deep learning</topic><topic>Deepfake</topic><topic>Deepfake detection</topic><topic>Deepfakes</topic><topic>Embedding</topic><topic>Face recognition</topic><topic>Faces</topic><topic>Invariants</topic><topic>Task analysis</topic><topic>Transformers</topic><topic>Vectors</topic><topic>Video</topic><topic>vision transformers</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Coccomini, Davide Alessandro</creatorcontrib><creatorcontrib>Zilos, Giorgos Kordopatis</creatorcontrib><creatorcontrib>Amato, Giuseppe</creatorcontrib><creatorcontrib>Caldelli, Roberto</creatorcontrib><creatorcontrib>Falchi, Fabrizio</creatorcontrib><creatorcontrib>Papadopoulos, Symeon</creatorcontrib><creatorcontrib>Gennaro, Claudio</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Mechanical &amp; Transportation Engineering Abstracts</collection><collection>Technology Research Database</collection><collection>Engineering Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Civil Engineering Abstracts</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on information forensics and security</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Coccomini, Davide Alessandro</au><au>Zilos, Giorgos Kordopatis</au><au>Amato, Giuseppe</au><au>Caldelli, Roberto</au><au>Falchi, Fabrizio</au><au>Papadopoulos, Symeon</au><au>Gennaro, Claudio</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>MINTIME: Multi-Identity Size-Invariant Video Deepfake Detection</atitle><jtitle>IEEE transactions on information forensics and security</jtitle><stitle>TIFS</stitle><date>2024</date><risdate>2024</risdate><volume>19</volume><spage>6084</spage><epage>6096</epage><pages>6084-6096</pages><issn>1556-6013</issn><eissn>1556-6021</eissn><coden>ITIFA6</coden><abstract>In this paper, we present MINTIME, a video deepfake detection method that effectively captures spatial and temporal inconsistencies in videos that depict multiple individuals and varying face sizes. Unlike previous approaches that either employ simplistic a-posteriori aggregation schemes, i.e., averaging or max operations, or only focus on the largest face in the video, our proposed method learns to accurately detect spatio-temporal inconsistencies across multiple identities in a video through a Spatio-Temporal Transformer combined with a Convolutional Neural Network backbone. This is achieved through an Identity-aware Attention mechanism that applies a masking operation on the face sequence to process each identity independently, which enables effective video-level aggregation. Furthermore, our system incorporates two novel embedding schemes: (i) the Temporal Coherent Positional Embedding, which encodes the temporal information of the face sequences of each identity, and (ii) the Size Embedding, which captures the relative sizes of the faces to the video frames. MINTIME achieves state-of-the-art performance on the ForgeryNet dataset, with a remarkable improvement of up to 14% AUC in videos containing multiple people. Moreover, it demonstrates very robust generalization capabilities in cross-forgery and cross-dataset settings. The code is publicly available at: https://github.com/davide-coccomini/MINTIME-Multi-Identity-size-iNvariant-TIMEsformer-for-Video-Deepfake-Detection .</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TIFS.2024.3409054</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0001-6258-5313</orcidid><orcidid>https://orcid.org/0000-0003-3471-1196</orcidid><orcidid>https://orcid.org/0000-0002-0755-6154</orcidid><orcidid>https://orcid.org/0000-0003-0171-4315</orcidid><orcidid>https://orcid.org/0000-0002-5441-7341</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1556-6013
ispartof IEEE transactions on information forensics and security, 2024, Vol.19, p.6084-6096
issn 1556-6013
1556-6021
language eng
recordid cdi_ieee_primary_10547206
source IEEE Electronic Library (IEL)
subjects Artificial neural networks
computer vision
Convolutional neural networks
Datasets
Deception
deep learning
Deepfake
Deepfake detection
Deepfakes
Embedding
Face recognition
Faces
Invariants
Task analysis
Transformers
Vectors
Video
vision transformers
title MINTIME: Multi-Identity Size-Invariant Video Deepfake Detection
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-10T08%3A07%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=MINTIME:%20Multi-Identity%20Size-Invariant%20Video%20Deepfake%20Detection&rft.jtitle=IEEE%20transactions%20on%20information%20forensics%20and%20security&rft.au=Coccomini,%20Davide%20Alessandro&rft.date=2024&rft.volume=19&rft.spage=6084&rft.epage=6096&rft.pages=6084-6096&rft.issn=1556-6013&rft.eissn=1556-6021&rft.coden=ITIFA6&rft_id=info:doi/10.1109/TIFS.2024.3409054&rft_dat=%3Cproquest_ieee_%3E3070781204%3C/proquest_ieee_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3070781204&rft_id=info:pmid/&rft_ieee_id=10547206&rfr_iscdi=true