SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents

Task-oriented dialogue (TOD) models have made significant progress in recent years. However, previous studies primarily focus on datasets written by annotators, which has resulted in a gap between academic research and real-world spoken conversation scenarios. While several small-scale spoken TOD da...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-03
Hauptverfasser:	Si, Shuzheng, Ma, Wentao, Gao, Haoyu, Wu, Yuchuan, Ting-En Lin, Dai, Yinpei, Li, Hangyu, Yan, Rui, Huang, Fei, Li, Yongbin
Format:	Artikel
Sprache:	eng
Schlagworte:	Datasets Domains Reasoning Speech Word processing Words (language)
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Si, Shuzheng Ma, Wentao Gao, Haoyu Wu, Yuchuan Ting-En Lin Dai, Yinpei Li, Hangyu Yan, Rui Huang, Fei Li, Yongbin
description	Task-oriented dialogue (TOD) models have made significant progress in recent years. However, previous studies primarily focus on datasets written by annotators, which has resulted in a gap between academic research and real-world spoken conversation scenarios. While several small-scale spoken TOD datasets are proposed to address robustness issues such as ASR errors, they ignore the unique challenges in spoken conversation. To tackle the limitations, we introduce SpokenWOZ, a large-scale speech-text dataset for spoken TOD, containing 8 domains, 203k turns, 5.7k dialogues and 249 hours of audios from human-to-human spoken conversations. SpokenWOZ further incorporates common spoken characteristics such as word-by-word processing and reasoning in spoken language. Based on these characteristics, we present cross-turn slot and reasoning slot detection as new challenges. We conduct experiments on various baselines, including text-modal models, newly proposed dual-modal models, and LLMs, e.g., ChatGPT. The results show that the current models still have substantial room for improvement in spoken conversation, where the most advanced dialogue state tracker only achieves 25.65% in joint goal accuracy and the SOTA end-to-end model only correctly completes the user request in 52.1% of dialogues. The dataset, code, and leaderboard are available: https://spokenwoz.github.io/.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2817860726</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2817860726</sourcerecordid><originalsourceid>FETCH-proquest_journals_28178607263</originalsourceid><addsrcrecordid>eNqNi8sKgkAYRocgSMp3-KH1wDjmhXZ2o0XgQkFoI4P93puxGYUeP6EeoNUH55xvQSzuug4Nd5yviG1MyxjjfsA9z7VIlgyqQ5nF9z1EcBO6QpoUokdIBsSipim-RzigLOqn0B2USsP3AqkwHY11g3LEB5wa0atqQoiqGZgNWZaiN2j_dk22l3N6vNJBq9eEZsxbNWk5q5yHThD6LOC--1_1AX4hQDc</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2817860726</pqid></control><display><type>article</type><title>SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents</title><source>Free E- Journals</source><creator>Si, Shuzheng ; Ma, Wentao ; Gao, Haoyu ; Wu, Yuchuan ; Ting-En Lin ; Dai, Yinpei ; Li, Hangyu ; Yan, Rui ; Huang, Fei ; Li, Yongbin</creator><creatorcontrib>Si, Shuzheng ; Ma, Wentao ; Gao, Haoyu ; Wu, Yuchuan ; Ting-En Lin ; Dai, Yinpei ; Li, Hangyu ; Yan, Rui ; Huang, Fei ; Li, Yongbin</creatorcontrib><description>Task-oriented dialogue (TOD) models have made significant progress in recent years. However, previous studies primarily focus on datasets written by annotators, which has resulted in a gap between academic research and real-world spoken conversation scenarios. While several small-scale spoken TOD datasets are proposed to address robustness issues such as ASR errors, they ignore the unique challenges in spoken conversation. To tackle the limitations, we introduce SpokenWOZ, a large-scale speech-text dataset for spoken TOD, containing 8 domains, 203k turns, 5.7k dialogues and 249 hours of audios from human-to-human spoken conversations. SpokenWOZ further incorporates common spoken characteristics such as word-by-word processing and reasoning in spoken language. Based on these characteristics, we present cross-turn slot and reasoning slot detection as new challenges. We conduct experiments on various baselines, including text-modal models, newly proposed dual-modal models, and LLMs, e.g., ChatGPT. The results show that the current models still have substantial room for improvement in spoken conversation, where the most advanced dialogue state tracker only achieves 25.65% in joint goal accuracy and the SOTA end-to-end model only correctly completes the user request in 52.1% of dialogues. The dataset, code, and leaderboard are available: https://spokenwoz.github.io/.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Datasets ; Domains ; Reasoning ; Speech ; Word processing ; Words (language)</subject><ispartof>arXiv.org, 2024-03</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Si, Shuzheng</creatorcontrib><creatorcontrib>Ma, Wentao</creatorcontrib><creatorcontrib>Gao, Haoyu</creatorcontrib><creatorcontrib>Wu, Yuchuan</creatorcontrib><creatorcontrib>Ting-En Lin</creatorcontrib><creatorcontrib>Dai, Yinpei</creatorcontrib><creatorcontrib>Li, Hangyu</creatorcontrib><creatorcontrib>Yan, Rui</creatorcontrib><creatorcontrib>Huang, Fei</creatorcontrib><creatorcontrib>Li, Yongbin</creatorcontrib><title>SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents</title><title>arXiv.org</title><description>Task-oriented dialogue (TOD) models have made significant progress in recent years. However, previous studies primarily focus on datasets written by annotators, which has resulted in a gap between academic research and real-world spoken conversation scenarios. While several small-scale spoken TOD datasets are proposed to address robustness issues such as ASR errors, they ignore the unique challenges in spoken conversation. To tackle the limitations, we introduce SpokenWOZ, a large-scale speech-text dataset for spoken TOD, containing 8 domains, 203k turns, 5.7k dialogues and 249 hours of audios from human-to-human spoken conversations. SpokenWOZ further incorporates common spoken characteristics such as word-by-word processing and reasoning in spoken language. Based on these characteristics, we present cross-turn slot and reasoning slot detection as new challenges. We conduct experiments on various baselines, including text-modal models, newly proposed dual-modal models, and LLMs, e.g., ChatGPT. The results show that the current models still have substantial room for improvement in spoken conversation, where the most advanced dialogue state tracker only achieves 25.65% in joint goal accuracy and the SOTA end-to-end model only correctly completes the user request in 52.1% of dialogues. The dataset, code, and leaderboard are available: https://spokenwoz.github.io/.</description><subject>Datasets</subject><subject>Domains</subject><subject>Reasoning</subject><subject>Speech</subject><subject>Word processing</subject><subject>Words (language)</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNi8sKgkAYRocgSMp3-KH1wDjmhXZ2o0XgQkFoI4P93puxGYUeP6EeoNUH55xvQSzuug4Nd5yviG1MyxjjfsA9z7VIlgyqQ5nF9z1EcBO6QpoUokdIBsSipim-RzigLOqn0B2USsP3AqkwHY11g3LEB5wa0atqQoiqGZgNWZaiN2j_dk22l3N6vNJBq9eEZsxbNWk5q5yHThD6LOC--1_1AX4hQDc</recordid><startdate>20240312</startdate><enddate>20240312</enddate><creator>Si, Shuzheng</creator><creator>Ma, Wentao</creator><creator>Gao, Haoyu</creator><creator>Wu, Yuchuan</creator><creator>Ting-En Lin</creator><creator>Dai, Yinpei</creator><creator>Li, Hangyu</creator><creator>Yan, Rui</creator><creator>Huang, Fei</creator><creator>Li, Yongbin</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240312</creationdate><title>SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents</title><author>Si, Shuzheng ; Ma, Wentao ; Gao, Haoyu ; Wu, Yuchuan ; Ting-En Lin ; Dai, Yinpei ; Li, Hangyu ; Yan, Rui ; Huang, Fei ; Li, Yongbin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28178607263</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Datasets</topic><topic>Domains</topic><topic>Reasoning</topic><topic>Speech</topic><topic>Word processing</topic><topic>Words (language)</topic><toplevel>online_resources</toplevel><creatorcontrib>Si, Shuzheng</creatorcontrib><creatorcontrib>Ma, Wentao</creatorcontrib><creatorcontrib>Gao, Haoyu</creatorcontrib><creatorcontrib>Wu, Yuchuan</creatorcontrib><creatorcontrib>Ting-En Lin</creatorcontrib><creatorcontrib>Dai, Yinpei</creatorcontrib><creatorcontrib>Li, Hangyu</creatorcontrib><creatorcontrib>Yan, Rui</creatorcontrib><creatorcontrib>Huang, Fei</creatorcontrib><creatorcontrib>Li, Yongbin</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Si, Shuzheng</au><au>Ma, Wentao</au><au>Gao, Haoyu</au><au>Wu, Yuchuan</au><au>Ting-En Lin</au><au>Dai, Yinpei</au><au>Li, Hangyu</au><au>Yan, Rui</au><au>Huang, Fei</au><au>Li, Yongbin</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents</atitle><jtitle>arXiv.org</jtitle><date>2024-03-12</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Task-oriented dialogue (TOD) models have made significant progress in recent years. However, previous studies primarily focus on datasets written by annotators, which has resulted in a gap between academic research and real-world spoken conversation scenarios. While several small-scale spoken TOD datasets are proposed to address robustness issues such as ASR errors, they ignore the unique challenges in spoken conversation. To tackle the limitations, we introduce SpokenWOZ, a large-scale speech-text dataset for spoken TOD, containing 8 domains, 203k turns, 5.7k dialogues and 249 hours of audios from human-to-human spoken conversations. SpokenWOZ further incorporates common spoken characteristics such as word-by-word processing and reasoning in spoken language. Based on these characteristics, we present cross-turn slot and reasoning slot detection as new challenges. We conduct experiments on various baselines, including text-modal models, newly proposed dual-modal models, and LLMs, e.g., ChatGPT. The results show that the current models still have substantial room for improvement in spoken conversation, where the most advanced dialogue state tracker only achieves 25.65% in joint goal accuracy and the SOTA end-to-end model only correctly completes the user request in 52.1% of dialogues. The dataset, code, and leaderboard are available: https://spokenwoz.github.io/.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-03
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2817860726
source	Free E- Journals
subjects	Datasets Domains Reasoning Speech Word processing Words (language)
title	SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-02T07%3A37%3A55IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=SpokenWOZ:%20A%20Large-Scale%20Speech-Text%20Benchmark%20for%20Spoken%20Task-Oriented%20Dialogue%20Agents&rft.jtitle=arXiv.org&rft.au=Si,%20Shuzheng&rft.date=2024-03-12&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2817860726%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2817860726&rft_id=info:pmid/&rfr_iscdi=true