A Study of Multimodal Addressee Detection in Human-Human-Computer Interaction

The goal of addressee detection is to answer the question , "Are you talking to me?" When a dialogue system interacts with multiple users, it is crucial to detect when a user is speaking to the system as opposed to another person. We study this problem in a multimodal scenario, using lexic...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on multimedia 2015-09, Vol.17 (9), p.1550-1561
Hauptverfasser:	Tsai, T. J., Stolcke, Andreas, Slaney, Malcolm
Format:	Artikel
Sprache:	eng
Schlagworte:	Acoustics Addressee detection Beamforming Computational modeling Computers dialogue system Face Feature extraction head pose human-human-computer Interactive Multimedia multimodal multiparty Open spaces prosody Speech Speech recognition Talking Visual Visualization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	1561
container_issue	9
container_start_page	1550
container_title	IEEE transactions on multimedia
container_volume	17
creator	Tsai, T. J. Stolcke, Andreas Slaney, Malcolm
description	The goal of addressee detection is to answer the question , "Are you talking to me?" When a dialogue system interacts with multiple users, it is crucial to detect when a user is speaking to the system as opposed to another person. We study this problem in a multimodal scenario, using lexical, acoustic, visual, dialogue state, and beamforming information. Using data from a multiparty dialogue system, we quantify the benefits of using multiple modalities over using a single modality. We also assess the relative importance of the various modalities, as well as of key individual features, in estimating the addressee. We find that energy-based acoustic features are by far the most important, that information from speech recognition and system state is useful as well, and that visual and beamforming features provide little additional benefit. While we find that head pose is affected by whom the speaker is addressing, it yields little nonredundant information due to the system acting as a situational attractor. Our findings would be relevant to multiparty, open-world dialogue systems in which the agent plays an active, conversational role, such as an interactive assistant deployed in a public, open space. For these scenarios , our study suggests that acoustic, lexical, and system-state information is an effective and practical combination of modalities to use for addressee detection. We also consider how our analyses might be affected by the ongoing development of more realistic, natural dialogue systems.
doi_str_mv	10.1109/TMM.2015.2454332
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_7153545</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>7153545</ieee_id><sourcerecordid>3778597401</sourcerecordid><originalsourceid>FETCH-LOGICAL-c390t-c87e2689db07987b41037a9973642ed31f2769f72363f0120d04cebb94ac496d3</originalsourceid><addsrcrecordid>eNpdkD1PwzAQhi0EEqWwI7FYYmFJOX8kjseqfLRSIwbKbDnJRUqVxCVOhv57XFIxsNzd8LyvTg8h9wwWjIF-3mXZggOLF1zGUgh-QWZMSxYBKHUZ7phDpDmDa3Lj_R6AyRjUjGRL-jmM5ZG6imZjM9StK21Dl2XZo_eI9AUHLIbadbTu6HpsbRdNc-XawzhgTzddmPaXuSVXlW083p33nHy9ve5W62j78b5ZLbdRITQMUZEq5EmqyxyUTlUuGQhltVYikRxLwSquEl0pLhJRAeNQgiwwz7W0hdRJKebkaeo99O57RD-YtvYFNo3t0I3eMCVl6OdMB_TxH7p3Y9-F7wIFkkMaqzRQMFFF77zvsTKHvm5tfzQMzMmvCX7Nya85-w2RhylSI-IfrlgsYhmLH5-UdDU</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1704208578</pqid></control><display><type>article</type><title>A Study of Multimodal Addressee Detection in Human-Human-Computer Interaction</title><source>IEEE Electronic Library (IEL)</source><creator>Tsai, T. J. ; Stolcke, Andreas ; Slaney, Malcolm</creator><creatorcontrib>Tsai, T. J. ; Stolcke, Andreas ; Slaney, Malcolm</creatorcontrib><description>The goal of addressee detection is to answer the question , "Are you talking to me?" When a dialogue system interacts with multiple users, it is crucial to detect when a user is speaking to the system as opposed to another person. We study this problem in a multimodal scenario, using lexical, acoustic, visual, dialogue state, and beamforming information. Using data from a multiparty dialogue system, we quantify the benefits of using multiple modalities over using a single modality. We also assess the relative importance of the various modalities, as well as of key individual features, in estimating the addressee. We find that energy-based acoustic features are by far the most important, that information from speech recognition and system state is useful as well, and that visual and beamforming features provide little additional benefit. While we find that head pose is affected by whom the speaker is addressing, it yields little nonredundant information due to the system acting as a situational attractor. Our findings would be relevant to multiparty, open-world dialogue systems in which the agent plays an active, conversational role, such as an interactive assistant deployed in a public, open space. For these scenarios , our study suggests that acoustic, lexical, and system-state information is an effective and practical combination of modalities to use for addressee detection. We also consider how our analyses might be affected by the ongoing development of more realistic, natural dialogue systems.</description><identifier>ISSN: 1520-9210</identifier><identifier>EISSN: 1941-0077</identifier><identifier>DOI: 10.1109/TMM.2015.2454332</identifier><identifier>CODEN: ITMUF8</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Acoustics ; Addressee detection ; Beamforming ; Computational modeling ; Computers ; dialogue system ; Face ; Feature extraction ; head pose ; human-human-computer ; Interactive ; Multimedia ; multimodal ; multiparty ; Open spaces ; prosody ; Speech ; Speech recognition ; Talking ; Visual ; Visualization</subject><ispartof>IEEE transactions on multimedia, 2015-09, Vol.17 (9), p.1550-1561</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2015</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c390t-c87e2689db07987b41037a9973642ed31f2769f72363f0120d04cebb94ac496d3</citedby><cites>FETCH-LOGICAL-c390t-c87e2689db07987b41037a9973642ed31f2769f72363f0120d04cebb94ac496d3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/7153545$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27903,27904,54737</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/7153545$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Tsai, T. J.</creatorcontrib><creatorcontrib>Stolcke, Andreas</creatorcontrib><creatorcontrib>Slaney, Malcolm</creatorcontrib><title>A Study of Multimodal Addressee Detection in Human-Human-Computer Interaction</title><title>IEEE transactions on multimedia</title><addtitle>TMM</addtitle><description>The goal of addressee detection is to answer the question , "Are you talking to me?" When a dialogue system interacts with multiple users, it is crucial to detect when a user is speaking to the system as opposed to another person. We study this problem in a multimodal scenario, using lexical, acoustic, visual, dialogue state, and beamforming information. Using data from a multiparty dialogue system, we quantify the benefits of using multiple modalities over using a single modality. We also assess the relative importance of the various modalities, as well as of key individual features, in estimating the addressee. We find that energy-based acoustic features are by far the most important, that information from speech recognition and system state is useful as well, and that visual and beamforming features provide little additional benefit. While we find that head pose is affected by whom the speaker is addressing, it yields little nonredundant information due to the system acting as a situational attractor. Our findings would be relevant to multiparty, open-world dialogue systems in which the agent plays an active, conversational role, such as an interactive assistant deployed in a public, open space. For these scenarios , our study suggests that acoustic, lexical, and system-state information is an effective and practical combination of modalities to use for addressee detection. We also consider how our analyses might be affected by the ongoing development of more realistic, natural dialogue systems.</description><subject>Acoustics</subject><subject>Addressee detection</subject><subject>Beamforming</subject><subject>Computational modeling</subject><subject>Computers</subject><subject>dialogue system</subject><subject>Face</subject><subject>Feature extraction</subject><subject>head pose</subject><subject>human-human-computer</subject><subject>Interactive</subject><subject>Multimedia</subject><subject>multimodal</subject><subject>multiparty</subject><subject>Open spaces</subject><subject>prosody</subject><subject>Speech</subject><subject>Speech recognition</subject><subject>Talking</subject><subject>Visual</subject><subject>Visualization</subject><issn>1520-9210</issn><issn>1941-0077</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2015</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpdkD1PwzAQhi0EEqWwI7FYYmFJOX8kjseqfLRSIwbKbDnJRUqVxCVOhv57XFIxsNzd8LyvTg8h9wwWjIF-3mXZggOLF1zGUgh-QWZMSxYBKHUZ7phDpDmDa3Lj_R6AyRjUjGRL-jmM5ZG6imZjM9StK21Dl2XZo_eI9AUHLIbadbTu6HpsbRdNc-XawzhgTzddmPaXuSVXlW083p33nHy9ve5W62j78b5ZLbdRITQMUZEq5EmqyxyUTlUuGQhltVYikRxLwSquEl0pLhJRAeNQgiwwz7W0hdRJKebkaeo99O57RD-YtvYFNo3t0I3eMCVl6OdMB_TxH7p3Y9-F7wIFkkMaqzRQMFFF77zvsTKHvm5tfzQMzMmvCX7Nya85-w2RhylSI-IfrlgsYhmLH5-UdDU</recordid><startdate>201509</startdate><enddate>201509</enddate><creator>Tsai, T. J.</creator><creator>Stolcke, Andreas</creator><creator>Slaney, Malcolm</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>F28</scope><scope>FR3</scope></search><sort><creationdate>201509</creationdate><title>A Study of Multimodal Addressee Detection in Human-Human-Computer Interaction</title><author>Tsai, T. J. ; Stolcke, Andreas ; Slaney, Malcolm</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c390t-c87e2689db07987b41037a9973642ed31f2769f72363f0120d04cebb94ac496d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2015</creationdate><topic>Acoustics</topic><topic>Addressee detection</topic><topic>Beamforming</topic><topic>Computational modeling</topic><topic>Computers</topic><topic>dialogue system</topic><topic>Face</topic><topic>Feature extraction</topic><topic>head pose</topic><topic>human-human-computer</topic><topic>Interactive</topic><topic>Multimedia</topic><topic>multimodal</topic><topic>multiparty</topic><topic>Open spaces</topic><topic>prosody</topic><topic>Speech</topic><topic>Speech recognition</topic><topic>Talking</topic><topic>Visual</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Tsai, T. J.</creatorcontrib><creatorcontrib>Stolcke, Andreas</creatorcontrib><creatorcontrib>Slaney, Malcolm</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>ANTE: Abstracts in New Technology & Engineering</collection><collection>Engineering Research Database</collection><jtitle>IEEE transactions on multimedia</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Tsai, T. J.</au><au>Stolcke, Andreas</au><au>Slaney, Malcolm</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Study of Multimodal Addressee Detection in Human-Human-Computer Interaction</atitle><jtitle>IEEE transactions on multimedia</jtitle><stitle>TMM</stitle><date>2015-09</date><risdate>2015</risdate><volume>17</volume><issue>9</issue><spage>1550</spage><epage>1561</epage><pages>1550-1561</pages><issn>1520-9210</issn><eissn>1941-0077</eissn><coden>ITMUF8</coden><abstract>The goal of addressee detection is to answer the question , "Are you talking to me?" When a dialogue system interacts with multiple users, it is crucial to detect when a user is speaking to the system as opposed to another person. We study this problem in a multimodal scenario, using lexical, acoustic, visual, dialogue state, and beamforming information. Using data from a multiparty dialogue system, we quantify the benefits of using multiple modalities over using a single modality. We also assess the relative importance of the various modalities, as well as of key individual features, in estimating the addressee. We find that energy-based acoustic features are by far the most important, that information from speech recognition and system state is useful as well, and that visual and beamforming features provide little additional benefit. While we find that head pose is affected by whom the speaker is addressing, it yields little nonredundant information due to the system acting as a situational attractor. Our findings would be relevant to multiparty, open-world dialogue systems in which the agent plays an active, conversational role, such as an interactive assistant deployed in a public, open space. For these scenarios , our study suggests that acoustic, lexical, and system-state information is an effective and practical combination of modalities to use for addressee detection. We also consider how our analyses might be affected by the ongoing development of more realistic, natural dialogue systems.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TMM.2015.2454332</doi><tpages>12</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1520-9210
ispartof	IEEE transactions on multimedia, 2015-09, Vol.17 (9), p.1550-1561
issn	1520-9210 1941-0077
language	eng
recordid	cdi_ieee_primary_7153545
source	IEEE Electronic Library (IEL)
subjects	Acoustics Addressee detection Beamforming Computational modeling Computers dialogue system Face Feature extraction head pose human-human-computer Interactive Multimedia multimodal multiparty Open spaces prosody Speech Speech recognition Talking Visual Visualization
title	A Study of Multimodal Addressee Detection in Human-Human-Computer Interaction
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-21T16%3A19%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Study%20of%20Multimodal%20Addressee%20Detection%20in%20Human-Human-Computer%20Interaction&rft.jtitle=IEEE%20transactions%20on%20multimedia&rft.au=Tsai,%20T.%20J.&rft.date=2015-09&rft.volume=17&rft.issue=9&rft.spage=1550&rft.epage=1561&rft.pages=1550-1561&rft.issn=1520-9210&rft.eissn=1941-0077&rft.coden=ITMUF8&rft_id=info:doi/10.1109/TMM.2015.2454332&rft_dat=%3Cproquest_RIE%3E3778597401%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1704208578&rft_id=info:pmid/&rft_ieee_id=7153545&rfr_iscdi=true