Multi-modal humor segment prediction in video

Humor can be induced by various signals in the visual, linguistic, and vocal modalities emitted by humans. Finding humor in videos is an interesting but challenging task for an intelligent system. Previous methods predict humor in the sentence level given some text (e.g., speech transcript), sometim...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Multimedia systems 2023-08, Vol.29 (4), p.2389-2398
Hauptverfasser:	Yang, Zekun, Nakashima, Yuta, Takemura, Haruo
Format:	Artikel
Sprache:	eng
Schlagworte:	Annotations Computer Communication Networks Computer Graphics Computer Science Cryptology Data Storage Representation Linguistics Multimedia Information Systems Operating Systems Regular Paper Segments Sliding Speech Video Visual signals
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	2398
container_issue	4
container_start_page	2389
container_title	Multimedia systems
container_volume	29
creator	Yang, Zekun Nakashima, Yuta Takemura, Haruo
description	Humor can be induced by various signals in the visual, linguistic, and vocal modalities emitted by humans. Finding humor in videos is an interesting but challenging task for an intelligent system. Previous methods predict humor in the sentence level given some text (e.g., speech transcript), sometimes together with other modalities, such as videos and speech. Such methods ignore humor caused by the visual modality in their design, since their prediction is made for a sentence. In this work, we first give new annotations to humor based on a sitcom by setting up temporal segments of ground truth humor derived from the laughter track. Then, we propose a method to find these temporal segments of humor. We adopt an approach based on sliding window, where the visual modality is described by pose and facial features along with the linguistic modality given as subtitles in each sliding window. We use long short-term memory networks to encode the temporal dependency in poses and facial features and pre-trained BERT to handle subtitles. Experimental results show that our method improves the performance of humor prediction.
doi_str_mv	10.1007/s00530-023-01105-x
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2837397795</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2837397795</sourcerecordid><originalsourceid>FETCH-LOGICAL-c363t-bc475a43c408c9174d53b39e72f912340c38f22acbb91da6d2dac1e8b9551b9f3</originalsourceid><addsrcrecordid>eNp9kM1KxDAURoMoOI6-gKuC6-hNbtI0Sxn8gxE3ug5pko4dps2YtKJvb7WCO1d3c8534RByzuCSAairDCARKHCkwBhI-nFAFkwgp6yq-CFZgBacCl3yY3KS8xaAqRJhQejjuBta2kVvd8Xr2MVU5LDpQj8U-xR864Y29kXbF--tD_GUHDV2l8PZ712Sl9ub59U9XT_dPayu19RhiQOtnVDSCnQCKqeZEl5ijToo3mjGUYDDquHcurrWzNvSc28dC1WtpWS1bnBJLubdfYpvY8iD2cYx9dNLwytUqJXScqL4TLkUc06hMfvUdjZ9GgbmO4uZs5gpi_nJYj4mCWcpT3C_Celv-h_rC-EKZLQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2837397795</pqid></control><display><type>article</type><title>Multi-modal humor segment prediction in video</title><source>Springer Nature - Complete Springer Journals</source><creator>Yang, Zekun ; Nakashima, Yuta ; Takemura, Haruo</creator><creatorcontrib>Yang, Zekun ; Nakashima, Yuta ; Takemura, Haruo</creatorcontrib><description>Humor can be induced by various signals in the visual, linguistic, and vocal modalities emitted by humans. Finding humor in videos is an interesting but challenging task for an intelligent system. Previous methods predict humor in the sentence level given some text (e.g., speech transcript), sometimes together with other modalities, such as videos and speech. Such methods ignore humor caused by the visual modality in their design, since their prediction is made for a sentence. In this work, we first give new annotations to humor based on a sitcom by setting up temporal segments of ground truth humor derived from the laughter track. Then, we propose a method to find these temporal segments of humor. We adopt an approach based on sliding window, where the visual modality is described by pose and facial features along with the linguistic modality given as subtitles in each sliding window. We use long short-term memory networks to encode the temporal dependency in poses and facial features and pre-trained BERT to handle subtitles. Experimental results show that our method improves the performance of humor prediction.</description><identifier>ISSN: 0942-4962</identifier><identifier>EISSN: 1432-1882</identifier><identifier>DOI: 10.1007/s00530-023-01105-x</identifier><language>eng</language><publisher>Berlin/Heidelberg: Springer Berlin Heidelberg</publisher><subject>Annotations ; Computer Communication Networks ; Computer Graphics ; Computer Science ; Cryptology ; Data Storage Representation ; Linguistics ; Multimedia Information Systems ; Operating Systems ; Regular Paper ; Segments ; Sliding ; Speech ; Video ; Visual signals</subject><ispartof>Multimedia systems, 2023-08, Vol.29 (4), p.2389-2398</ispartof><rights>The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2023</rights><rights>The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2023. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c363t-bc475a43c408c9174d53b39e72f912340c38f22acbb91da6d2dac1e8b9551b9f3</citedby><cites>FETCH-LOGICAL-c363t-bc475a43c408c9174d53b39e72f912340c38f22acbb91da6d2dac1e8b9551b9f3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s00530-023-01105-x$$EPDF$$P50$$Gspringer$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s00530-023-01105-x$$EHTML$$P50$$Gspringer$$Hfree_for_read</linktohtml><link.rule.ids>314,776,780,27903,27904,41467,42536,51298</link.rule.ids></links><search><creatorcontrib>Yang, Zekun</creatorcontrib><creatorcontrib>Nakashima, Yuta</creatorcontrib><creatorcontrib>Takemura, Haruo</creatorcontrib><title>Multi-modal humor segment prediction in video</title><title>Multimedia systems</title><addtitle>Multimedia Systems</addtitle><description>Humor can be induced by various signals in the visual, linguistic, and vocal modalities emitted by humans. Finding humor in videos is an interesting but challenging task for an intelligent system. Previous methods predict humor in the sentence level given some text (e.g., speech transcript), sometimes together with other modalities, such as videos and speech. Such methods ignore humor caused by the visual modality in their design, since their prediction is made for a sentence. In this work, we first give new annotations to humor based on a sitcom by setting up temporal segments of ground truth humor derived from the laughter track. Then, we propose a method to find these temporal segments of humor. We adopt an approach based on sliding window, where the visual modality is described by pose and facial features along with the linguistic modality given as subtitles in each sliding window. We use long short-term memory networks to encode the temporal dependency in poses and facial features and pre-trained BERT to handle subtitles. Experimental results show that our method improves the performance of humor prediction.</description><subject>Annotations</subject><subject>Computer Communication Networks</subject><subject>Computer Graphics</subject><subject>Computer Science</subject><subject>Cryptology</subject><subject>Data Storage Representation</subject><subject>Linguistics</subject><subject>Multimedia Information Systems</subject><subject>Operating Systems</subject><subject>Regular Paper</subject><subject>Segments</subject><subject>Sliding</subject><subject>Speech</subject><subject>Video</subject><subject>Visual signals</subject><issn>0942-4962</issn><issn>1432-1882</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>C6C</sourceid><recordid>eNp9kM1KxDAURoMoOI6-gKuC6-hNbtI0Sxn8gxE3ug5pko4dps2YtKJvb7WCO1d3c8534RByzuCSAairDCARKHCkwBhI-nFAFkwgp6yq-CFZgBacCl3yY3KS8xaAqRJhQejjuBta2kVvd8Xr2MVU5LDpQj8U-xR864Y29kXbF--tD_GUHDV2l8PZ712Sl9ub59U9XT_dPayu19RhiQOtnVDSCnQCKqeZEl5ijToo3mjGUYDDquHcurrWzNvSc28dC1WtpWS1bnBJLubdfYpvY8iD2cYx9dNLwytUqJXScqL4TLkUc06hMfvUdjZ9GgbmO4uZs5gpi_nJYj4mCWcpT3C_Celv-h_rC-EKZLQ</recordid><startdate>20230801</startdate><enddate>20230801</enddate><creator>Yang, Zekun</creator><creator>Nakashima, Yuta</creator><creator>Takemura, Haruo</creator><general>Springer Berlin Heidelberg</general><general>Springer Nature B.V</general><scope>C6C</scope><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20230801</creationdate><title>Multi-modal humor segment prediction in video</title><author>Yang, Zekun ; Nakashima, Yuta ; Takemura, Haruo</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c363t-bc475a43c408c9174d53b39e72f912340c38f22acbb91da6d2dac1e8b9551b9f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Annotations</topic><topic>Computer Communication Networks</topic><topic>Computer Graphics</topic><topic>Computer Science</topic><topic>Cryptology</topic><topic>Data Storage Representation</topic><topic>Linguistics</topic><topic>Multimedia Information Systems</topic><topic>Operating Systems</topic><topic>Regular Paper</topic><topic>Segments</topic><topic>Sliding</topic><topic>Speech</topic><topic>Video</topic><topic>Visual signals</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Yang, Zekun</creatorcontrib><creatorcontrib>Nakashima, Yuta</creatorcontrib><creatorcontrib>Takemura, Haruo</creatorcontrib><collection>Springer Nature OA Free Journals</collection><collection>CrossRef</collection><jtitle>Multimedia systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Yang, Zekun</au><au>Nakashima, Yuta</au><au>Takemura, Haruo</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Multi-modal humor segment prediction in video</atitle><jtitle>Multimedia systems</jtitle><stitle>Multimedia Systems</stitle><date>2023-08-01</date><risdate>2023</risdate><volume>29</volume><issue>4</issue><spage>2389</spage><epage>2398</epage><pages>2389-2398</pages><issn>0942-4962</issn><eissn>1432-1882</eissn><abstract>Humor can be induced by various signals in the visual, linguistic, and vocal modalities emitted by humans. Finding humor in videos is an interesting but challenging task for an intelligent system. Previous methods predict humor in the sentence level given some text (e.g., speech transcript), sometimes together with other modalities, such as videos and speech. Such methods ignore humor caused by the visual modality in their design, since their prediction is made for a sentence. In this work, we first give new annotations to humor based on a sitcom by setting up temporal segments of ground truth humor derived from the laughter track. Then, we propose a method to find these temporal segments of humor. We adopt an approach based on sliding window, where the visual modality is described by pose and facial features along with the linguistic modality given as subtitles in each sliding window. We use long short-term memory networks to encode the temporal dependency in poses and facial features and pre-trained BERT to handle subtitles. Experimental results show that our method improves the performance of humor prediction.</abstract><cop>Berlin/Heidelberg</cop><pub>Springer Berlin Heidelberg</pub><doi>10.1007/s00530-023-01105-x</doi><tpages>10</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 0942-4962
ispartof	Multimedia systems, 2023-08, Vol.29 (4), p.2389-2398
issn	0942-4962 1432-1882
language	eng
recordid	cdi_proquest_journals_2837397795
source	Springer Nature - Complete Springer Journals
subjects	Annotations Computer Communication Networks Computer Graphics Computer Science Cryptology Data Storage Representation Linguistics Multimedia Information Systems Operating Systems Regular Paper Segments Sliding Speech Video Visual signals
title	Multi-modal humor segment prediction in video
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-22T07%3A42%3A22IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Multi-modal%20humor%20segment%20prediction%20in%20video&rft.jtitle=Multimedia%20systems&rft.au=Yang,%20Zekun&rft.date=2023-08-01&rft.volume=29&rft.issue=4&rft.spage=2389&rft.epage=2398&rft.pages=2389-2398&rft.issn=0942-4962&rft.eissn=1432-1882&rft_id=info:doi/10.1007/s00530-023-01105-x&rft_dat=%3Cproquest_cross%3E2837397795%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2837397795&rft_id=info:pmid/&rfr_iscdi=true