Human Detection of Political Speech Deepfakes across Transcripts, Audio, and Video

Recent advances in technology for hyper-realistic visual and audio effects provoke the concern that deepfake videos of political speeches will soon be indistinguishable from authentic video recordings. The conventional wisdom in communication theory predicts people will fall for fake news more often...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Groh, Matthew, Sankaranarayanan, Aruna, Singh, Nikhil, Kim, Dong Young, Lippman, Andrew, Picard, Rosalind
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Human-Computer Interaction
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Groh, Matthew Sankaranarayanan, Aruna Singh, Nikhil Kim, Dong Young Lippman, Andrew Picard, Rosalind
description	Recent advances in technology for hyper-realistic visual and audio effects provoke the concern that deepfake videos of political speeches will soon be indistinguishable from authentic video recordings. The conventional wisdom in communication theory predicts people will fall for fake news more often when the same version of a story is presented as a video versus text. We conduct 5 pre-registered randomized experiments with 2,215 participants to evaluate how accurately humans distinguish real political speeches from fabrications across base rates of misinformation, audio sources, question framings, and media modalities. We find base rates of misinformation minimally influence discernment and deepfakes with audio produced by the state-of-the-art text-to-speech algorithms are harder to discern than the same deepfakes with voice actor audio. Moreover across all experiments, we find audio and visual information enables more accurate discernment than text alone: human discernment relies more on how something is said, the audio-visual cues, than what is said, the speech content.
doi_str_mv	10.48550/arxiv.2202.12883
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2202_12883</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2202_12883</sourcerecordid><originalsourceid>FETCH-LOGICAL-a673-19b166c75670841359631d789ff884f2883abb45b5bb8a8295f47fdaf7643bcf3</originalsourceid><addsrcrecordid>eNotz8lOwzAYBGBfOKDCA3DCD9CEeHeOVVmKVAlEo16j35uwSOPIThG8PbRwmsNIo_kQuiFNzbUQzR3kr_hZU9rQmlCt2SV62xwPMOJ7P3s7xzTiFPBrGuIcLQx4N3lv339bPwX48AWDzakU3GUYi81xmssSr44upiWG0eF9dD5doYsAQ_HX_7lA3eNDt95U25en5_VqW4FUrCKtIVJaJaRqNCdMtJIRp3QbgtY8nN6BMVwYYYwGTVsRuAoOgpKcGRvYAt3-zZ5R_ZTjAfJ3f8L1Zxz7AUrrSTc</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Human Detection of Political Speech Deepfakes across Transcripts, Audio, and Video</title><source>arXiv.org</source><creator>Groh, Matthew ; Sankaranarayanan, Aruna ; Singh, Nikhil ; Kim, Dong Young ; Lippman, Andrew ; Picard, Rosalind</creator><creatorcontrib>Groh, Matthew ; Sankaranarayanan, Aruna ; Singh, Nikhil ; Kim, Dong Young ; Lippman, Andrew ; Picard, Rosalind</creatorcontrib><description>Recent advances in technology for hyper-realistic visual and audio effects provoke the concern that deepfake videos of political speeches will soon be indistinguishable from authentic video recordings. The conventional wisdom in communication theory predicts people will fall for fake news more often when the same version of a story is presented as a video versus text. We conduct 5 pre-registered randomized experiments with 2,215 participants to evaluate how accurately humans distinguish real political speeches from fabrications across base rates of misinformation, audio sources, question framings, and media modalities. We find base rates of misinformation minimally influence discernment and deepfakes with audio produced by the state-of-the-art text-to-speech algorithms are harder to discern than the same deepfakes with voice actor audio. Moreover across all experiments, we find audio and visual information enables more accurate discernment than text alone: human discernment relies more on how something is said, the audio-visual cues, than what is said, the speech content.</description><identifier>DOI: 10.48550/arxiv.2202.12883</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Human-Computer Interaction</subject><creationdate>2022-02</creationdate><rights>http://creativecommons.org/licenses/by-nc-sa/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,777,882</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2202.12883$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2202.12883$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Groh, Matthew</creatorcontrib><creatorcontrib>Sankaranarayanan, Aruna</creatorcontrib><creatorcontrib>Singh, Nikhil</creatorcontrib><creatorcontrib>Kim, Dong Young</creatorcontrib><creatorcontrib>Lippman, Andrew</creatorcontrib><creatorcontrib>Picard, Rosalind</creatorcontrib><title>Human Detection of Political Speech Deepfakes across Transcripts, Audio, and Video</title><description>Recent advances in technology for hyper-realistic visual and audio effects provoke the concern that deepfake videos of political speeches will soon be indistinguishable from authentic video recordings. The conventional wisdom in communication theory predicts people will fall for fake news more often when the same version of a story is presented as a video versus text. We conduct 5 pre-registered randomized experiments with 2,215 participants to evaluate how accurately humans distinguish real political speeches from fabrications across base rates of misinformation, audio sources, question framings, and media modalities. We find base rates of misinformation minimally influence discernment and deepfakes with audio produced by the state-of-the-art text-to-speech algorithms are harder to discern than the same deepfakes with voice actor audio. Moreover across all experiments, we find audio and visual information enables more accurate discernment than text alone: human discernment relies more on how something is said, the audio-visual cues, than what is said, the speech content.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Human-Computer Interaction</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz8lOwzAYBGBfOKDCA3DCD9CEeHeOVVmKVAlEo16j35uwSOPIThG8PbRwmsNIo_kQuiFNzbUQzR3kr_hZU9rQmlCt2SV62xwPMOJ7P3s7xzTiFPBrGuIcLQx4N3lv339bPwX48AWDzakU3GUYi81xmssSr44upiWG0eF9dD5doYsAQ_HX_7lA3eNDt95U25en5_VqW4FUrCKtIVJaJaRqNCdMtJIRp3QbgtY8nN6BMVwYYYwGTVsRuAoOgpKcGRvYAt3-zZ5R_ZTjAfJ3f8L1Zxz7AUrrSTc</recordid><startdate>20220225</startdate><enddate>20220225</enddate><creator>Groh, Matthew</creator><creator>Sankaranarayanan, Aruna</creator><creator>Singh, Nikhil</creator><creator>Kim, Dong Young</creator><creator>Lippman, Andrew</creator><creator>Picard, Rosalind</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20220225</creationdate><title>Human Detection of Political Speech Deepfakes across Transcripts, Audio, and Video</title><author>Groh, Matthew ; Sankaranarayanan, Aruna ; Singh, Nikhil ; Kim, Dong Young ; Lippman, Andrew ; Picard, Rosalind</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a673-19b166c75670841359631d789ff884f2883abb45b5bb8a8295f47fdaf7643bcf3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Human-Computer Interaction</topic><toplevel>online_resources</toplevel><creatorcontrib>Groh, Matthew</creatorcontrib><creatorcontrib>Sankaranarayanan, Aruna</creatorcontrib><creatorcontrib>Singh, Nikhil</creatorcontrib><creatorcontrib>Kim, Dong Young</creatorcontrib><creatorcontrib>Lippman, Andrew</creatorcontrib><creatorcontrib>Picard, Rosalind</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Groh, Matthew</au><au>Sankaranarayanan, Aruna</au><au>Singh, Nikhil</au><au>Kim, Dong Young</au><au>Lippman, Andrew</au><au>Picard, Rosalind</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Human Detection of Political Speech Deepfakes across Transcripts, Audio, and Video</atitle><date>2022-02-25</date><risdate>2022</risdate><abstract>Recent advances in technology for hyper-realistic visual and audio effects provoke the concern that deepfake videos of political speeches will soon be indistinguishable from authentic video recordings. The conventional wisdom in communication theory predicts people will fall for fake news more often when the same version of a story is presented as a video versus text. We conduct 5 pre-registered randomized experiments with 2,215 participants to evaluate how accurately humans distinguish real political speeches from fabrications across base rates of misinformation, audio sources, question framings, and media modalities. We find base rates of misinformation minimally influence discernment and deepfakes with audio produced by the state-of-the-art text-to-speech algorithms are harder to discern than the same deepfakes with voice actor audio. Moreover across all experiments, we find audio and visual information enables more accurate discernment than text alone: human discernment relies more on how something is said, the audio-visual cues, than what is said, the speech content.</abstract><doi>10.48550/arxiv.2202.12883</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2202.12883
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2202_12883
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Human-Computer Interaction
title	Human Detection of Political Speech Deepfakes across Transcripts, Audio, and Video
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-18T23%3A02%3A30IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Human%20Detection%20of%20Political%20Speech%20Deepfakes%20across%20Transcripts,%20Audio,%20and%20Video&rft.au=Groh,%20Matthew&rft.date=2022-02-25&rft_id=info:doi/10.48550/arxiv.2202.12883&rft_dat=%3Carxiv_GOX%3E2202_12883%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true