MINTIME: Multi-Identity Size-Invariant Video Deepfake Detection
In this paper, we present MINTIME, a video deepfake detection method that effectively captures spatial and temporal inconsistencies in videos that depict multiple individuals and varying face sizes. Unlike previous approaches that either employ simplistic a-posteriori aggregation schemes, i.e., aver...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on information forensics and security 2024, Vol.19, p.6084-6096 |
---|---|
Hauptverfasser: | , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 6096 |
---|---|
container_issue | |
container_start_page | 6084 |
container_title | IEEE transactions on information forensics and security |
container_volume | 19 |
creator | Coccomini, Davide Alessandro Zilos, Giorgos Kordopatis Amato, Giuseppe Caldelli, Roberto Falchi, Fabrizio Papadopoulos, Symeon Gennaro, Claudio |
description | In this paper, we present MINTIME, a video deepfake detection method that effectively captures spatial and temporal inconsistencies in videos that depict multiple individuals and varying face sizes. Unlike previous approaches that either employ simplistic a-posteriori aggregation schemes, i.e., averaging or max operations, or only focus on the largest face in the video, our proposed method learns to accurately detect spatio-temporal inconsistencies across multiple identities in a video through a Spatio-Temporal Transformer combined with a Convolutional Neural Network backbone. This is achieved through an Identity-aware Attention mechanism that applies a masking operation on the face sequence to process each identity independently, which enables effective video-level aggregation. Furthermore, our system incorporates two novel embedding schemes: (i) the Temporal Coherent Positional Embedding, which encodes the temporal information of the face sequences of each identity, and (ii) the Size Embedding, which captures the relative sizes of the faces to the video frames. MINTIME achieves state-of-the-art performance on the ForgeryNet dataset, with a remarkable improvement of up to 14% AUC in videos containing multiple people. Moreover, it demonstrates very robust generalization capabilities in cross-forgery and cross-dataset settings. The code is publicly available at: https://github.com/davide-coccomini/MINTIME-Multi-Identity-size-iNvariant-TIMEsformer-for-Video-Deepfake-Detection . |
doi_str_mv | 10.1109/TIFS.2024.3409054 |
format | Article |
fullrecord | <record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_ieee_primary_10547206</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10547206</ieee_id><sourcerecordid>3070781204</sourcerecordid><originalsourceid>FETCH-LOGICAL-c219t-30d4bdd831ad57e67534d9f15ef7652b31e1dd8e72322f38c2425cb3ea431eee3</originalsourceid><addsrcrecordid>eNpNkEtLAzEUhYMoWKs_QHAx4Hpqbh6TGTcifehAq4tWtyGd3IHUOlMzqVB_vSkt4uoeOOfcAx8h10AHALS4W5ST-YBRJgZc0IJKcUJ6IGWWZpTB6Z8Gfk4uum5FqRCQ5T3yMCtfFuVsfJ_Mtuvg0tJiE1zYJXP3g2nZfBvvTBOSd2exTUaIm9p8YBQBq-Da5pKc1Wbd4dXx9snbZLwYPqfT16dy-DhNKwZFSDm1YmltzsFYqTBTkgtb1CCxVplkSw4I0UbFOGM1zysmmKyWHI2IFiLvk9vD341vv7bYBb1qt76Jk5pTRVUOjIqYgkOq8m3Xeaz1xrtP43caqN5z0ntOes9JHznFzs2h4-LOv7wUitGM_wJFamJP</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3070781204</pqid></control><display><type>article</type><title>MINTIME: Multi-Identity Size-Invariant Video Deepfake Detection</title><source>IEEE Electronic Library (IEL)</source><creator>Coccomini, Davide Alessandro ; Zilos, Giorgos Kordopatis ; Amato, Giuseppe ; Caldelli, Roberto ; Falchi, Fabrizio ; Papadopoulos, Symeon ; Gennaro, Claudio</creator><creatorcontrib>Coccomini, Davide Alessandro ; Zilos, Giorgos Kordopatis ; Amato, Giuseppe ; Caldelli, Roberto ; Falchi, Fabrizio ; Papadopoulos, Symeon ; Gennaro, Claudio</creatorcontrib><description>In this paper, we present MINTIME, a video deepfake detection method that effectively captures spatial and temporal inconsistencies in videos that depict multiple individuals and varying face sizes. Unlike previous approaches that either employ simplistic a-posteriori aggregation schemes, i.e., averaging or max operations, or only focus on the largest face in the video, our proposed method learns to accurately detect spatio-temporal inconsistencies across multiple identities in a video through a Spatio-Temporal Transformer combined with a Convolutional Neural Network backbone. This is achieved through an Identity-aware Attention mechanism that applies a masking operation on the face sequence to process each identity independently, which enables effective video-level aggregation. Furthermore, our system incorporates two novel embedding schemes: (i) the Temporal Coherent Positional Embedding, which encodes the temporal information of the face sequences of each identity, and (ii) the Size Embedding, which captures the relative sizes of the faces to the video frames. MINTIME achieves state-of-the-art performance on the ForgeryNet dataset, with a remarkable improvement of up to 14% AUC in videos containing multiple people. Moreover, it demonstrates very robust generalization capabilities in cross-forgery and cross-dataset settings. The code is publicly available at: https://github.com/davide-coccomini/MINTIME-Multi-Identity-size-iNvariant-TIMEsformer-for-Video-Deepfake-Detection .</description><identifier>ISSN: 1556-6013</identifier><identifier>EISSN: 1556-6021</identifier><identifier>DOI: 10.1109/TIFS.2024.3409054</identifier><identifier>CODEN: ITIFA6</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Artificial neural networks ; computer vision ; Convolutional neural networks ; Datasets ; Deception ; deep learning ; Deepfake ; Deepfake detection ; Deepfakes ; Embedding ; Face recognition ; Faces ; Invariants ; Task analysis ; Transformers ; Vectors ; Video ; vision transformers</subject><ispartof>IEEE transactions on information forensics and security, 2024, Vol.19, p.6084-6096</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c219t-30d4bdd831ad57e67534d9f15ef7652b31e1dd8e72322f38c2425cb3ea431eee3</cites><orcidid>0000-0001-6258-5313 ; 0000-0003-3471-1196 ; 0000-0002-0755-6154 ; 0000-0003-0171-4315 ; 0000-0002-5441-7341</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10547206$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,796,4022,27921,27922,27923,54756</link.rule.ids></links><search><creatorcontrib>Coccomini, Davide Alessandro</creatorcontrib><creatorcontrib>Zilos, Giorgos Kordopatis</creatorcontrib><creatorcontrib>Amato, Giuseppe</creatorcontrib><creatorcontrib>Caldelli, Roberto</creatorcontrib><creatorcontrib>Falchi, Fabrizio</creatorcontrib><creatorcontrib>Papadopoulos, Symeon</creatorcontrib><creatorcontrib>Gennaro, Claudio</creatorcontrib><title>MINTIME: Multi-Identity Size-Invariant Video Deepfake Detection</title><title>IEEE transactions on information forensics and security</title><addtitle>TIFS</addtitle><description>In this paper, we present MINTIME, a video deepfake detection method that effectively captures spatial and temporal inconsistencies in videos that depict multiple individuals and varying face sizes. Unlike previous approaches that either employ simplistic a-posteriori aggregation schemes, i.e., averaging or max operations, or only focus on the largest face in the video, our proposed method learns to accurately detect spatio-temporal inconsistencies across multiple identities in a video through a Spatio-Temporal Transformer combined with a Convolutional Neural Network backbone. This is achieved through an Identity-aware Attention mechanism that applies a masking operation on the face sequence to process each identity independently, which enables effective video-level aggregation. Furthermore, our system incorporates two novel embedding schemes: (i) the Temporal Coherent Positional Embedding, which encodes the temporal information of the face sequences of each identity, and (ii) the Size Embedding, which captures the relative sizes of the faces to the video frames. MINTIME achieves state-of-the-art performance on the ForgeryNet dataset, with a remarkable improvement of up to 14% AUC in videos containing multiple people. Moreover, it demonstrates very robust generalization capabilities in cross-forgery and cross-dataset settings. The code is publicly available at: https://github.com/davide-coccomini/MINTIME-Multi-Identity-size-iNvariant-TIMEsformer-for-Video-Deepfake-Detection .</description><subject>Artificial neural networks</subject><subject>computer vision</subject><subject>Convolutional neural networks</subject><subject>Datasets</subject><subject>Deception</subject><subject>deep learning</subject><subject>Deepfake</subject><subject>Deepfake detection</subject><subject>Deepfakes</subject><subject>Embedding</subject><subject>Face recognition</subject><subject>Faces</subject><subject>Invariants</subject><subject>Task analysis</subject><subject>Transformers</subject><subject>Vectors</subject><subject>Video</subject><subject>vision transformers</subject><issn>1556-6013</issn><issn>1556-6021</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><recordid>eNpNkEtLAzEUhYMoWKs_QHAx4Hpqbh6TGTcifehAq4tWtyGd3IHUOlMzqVB_vSkt4uoeOOfcAx8h10AHALS4W5ST-YBRJgZc0IJKcUJ6IGWWZpTB6Z8Gfk4uum5FqRCQ5T3yMCtfFuVsfJ_Mtuvg0tJiE1zYJXP3g2nZfBvvTBOSd2exTUaIm9p8YBQBq-Da5pKc1Wbd4dXx9snbZLwYPqfT16dy-DhNKwZFSDm1YmltzsFYqTBTkgtb1CCxVplkSw4I0UbFOGM1zysmmKyWHI2IFiLvk9vD341vv7bYBb1qt76Jk5pTRVUOjIqYgkOq8m3Xeaz1xrtP43caqN5z0ntOes9JHznFzs2h4-LOv7wUitGM_wJFamJP</recordid><startdate>2024</startdate><enddate>2024</enddate><creator>Coccomini, Davide Alessandro</creator><creator>Zilos, Giorgos Kordopatis</creator><creator>Amato, Giuseppe</creator><creator>Caldelli, Roberto</creator><creator>Falchi, Fabrizio</creator><creator>Papadopoulos, Symeon</creator><creator>Gennaro, Claudio</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7TB</scope><scope>8FD</scope><scope>FR3</scope><scope>JQ2</scope><scope>KR7</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0001-6258-5313</orcidid><orcidid>https://orcid.org/0000-0003-3471-1196</orcidid><orcidid>https://orcid.org/0000-0002-0755-6154</orcidid><orcidid>https://orcid.org/0000-0003-0171-4315</orcidid><orcidid>https://orcid.org/0000-0002-5441-7341</orcidid></search><sort><creationdate>2024</creationdate><title>MINTIME: Multi-Identity Size-Invariant Video Deepfake Detection</title><author>Coccomini, Davide Alessandro ; Zilos, Giorgos Kordopatis ; Amato, Giuseppe ; Caldelli, Roberto ; Falchi, Fabrizio ; Papadopoulos, Symeon ; Gennaro, Claudio</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c219t-30d4bdd831ad57e67534d9f15ef7652b31e1dd8e72322f38c2425cb3ea431eee3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Artificial neural networks</topic><topic>computer vision</topic><topic>Convolutional neural networks</topic><topic>Datasets</topic><topic>Deception</topic><topic>deep learning</topic><topic>Deepfake</topic><topic>Deepfake detection</topic><topic>Deepfakes</topic><topic>Embedding</topic><topic>Face recognition</topic><topic>Faces</topic><topic>Invariants</topic><topic>Task analysis</topic><topic>Transformers</topic><topic>Vectors</topic><topic>Video</topic><topic>vision transformers</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Coccomini, Davide Alessandro</creatorcontrib><creatorcontrib>Zilos, Giorgos Kordopatis</creatorcontrib><creatorcontrib>Amato, Giuseppe</creatorcontrib><creatorcontrib>Caldelli, Roberto</creatorcontrib><creatorcontrib>Falchi, Fabrizio</creatorcontrib><creatorcontrib>Papadopoulos, Symeon</creatorcontrib><creatorcontrib>Gennaro, Claudio</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Mechanical & Transportation Engineering Abstracts</collection><collection>Technology Research Database</collection><collection>Engineering Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Civil Engineering Abstracts</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on information forensics and security</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Coccomini, Davide Alessandro</au><au>Zilos, Giorgos Kordopatis</au><au>Amato, Giuseppe</au><au>Caldelli, Roberto</au><au>Falchi, Fabrizio</au><au>Papadopoulos, Symeon</au><au>Gennaro, Claudio</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>MINTIME: Multi-Identity Size-Invariant Video Deepfake Detection</atitle><jtitle>IEEE transactions on information forensics and security</jtitle><stitle>TIFS</stitle><date>2024</date><risdate>2024</risdate><volume>19</volume><spage>6084</spage><epage>6096</epage><pages>6084-6096</pages><issn>1556-6013</issn><eissn>1556-6021</eissn><coden>ITIFA6</coden><abstract>In this paper, we present MINTIME, a video deepfake detection method that effectively captures spatial and temporal inconsistencies in videos that depict multiple individuals and varying face sizes. Unlike previous approaches that either employ simplistic a-posteriori aggregation schemes, i.e., averaging or max operations, or only focus on the largest face in the video, our proposed method learns to accurately detect spatio-temporal inconsistencies across multiple identities in a video through a Spatio-Temporal Transformer combined with a Convolutional Neural Network backbone. This is achieved through an Identity-aware Attention mechanism that applies a masking operation on the face sequence to process each identity independently, which enables effective video-level aggregation. Furthermore, our system incorporates two novel embedding schemes: (i) the Temporal Coherent Positional Embedding, which encodes the temporal information of the face sequences of each identity, and (ii) the Size Embedding, which captures the relative sizes of the faces to the video frames. MINTIME achieves state-of-the-art performance on the ForgeryNet dataset, with a remarkable improvement of up to 14% AUC in videos containing multiple people. Moreover, it demonstrates very robust generalization capabilities in cross-forgery and cross-dataset settings. The code is publicly available at: https://github.com/davide-coccomini/MINTIME-Multi-Identity-size-iNvariant-TIMEsformer-for-Video-Deepfake-Detection .</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TIFS.2024.3409054</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0001-6258-5313</orcidid><orcidid>https://orcid.org/0000-0003-3471-1196</orcidid><orcidid>https://orcid.org/0000-0002-0755-6154</orcidid><orcidid>https://orcid.org/0000-0003-0171-4315</orcidid><orcidid>https://orcid.org/0000-0002-5441-7341</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1556-6013 |
ispartof | IEEE transactions on information forensics and security, 2024, Vol.19, p.6084-6096 |
issn | 1556-6013 1556-6021 |
language | eng |
recordid | cdi_ieee_primary_10547206 |
source | IEEE Electronic Library (IEL) |
subjects | Artificial neural networks computer vision Convolutional neural networks Datasets Deception deep learning Deepfake Deepfake detection Deepfakes Embedding Face recognition Faces Invariants Task analysis Transformers Vectors Video vision transformers |
title | MINTIME: Multi-Identity Size-Invariant Video Deepfake Detection |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-10T08%3A07%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=MINTIME:%20Multi-Identity%20Size-Invariant%20Video%20Deepfake%20Detection&rft.jtitle=IEEE%20transactions%20on%20information%20forensics%20and%20security&rft.au=Coccomini,%20Davide%20Alessandro&rft.date=2024&rft.volume=19&rft.spage=6084&rft.epage=6096&rft.pages=6084-6096&rft.issn=1556-6013&rft.eissn=1556-6021&rft.coden=ITIFA6&rft_id=info:doi/10.1109/TIFS.2024.3409054&rft_dat=%3Cproquest_ieee_%3E3070781204%3C/proquest_ieee_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3070781204&rft_id=info:pmid/&rft_ieee_id=10547206&rfr_iscdi=true |