Dialogizer: Context-aware Conversational-QA Dataset Generation from Textual Sources

To address the data scarcity issue in Conversational question answering (ConvQA), a dialog inpainting method, which utilizes documents to generate ConvQA datasets, has been proposed. However, the original dialog inpainting model is trained solely on the dialog reconstruction task, resulting in the g...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Hwang, Yerin, Kim, Yongil, Bae, Hyunkyung, Bang, Jeesoo, Lee, Hwanhee, Jung, Kyomin
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Computation and Language
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Hwang, Yerin Kim, Yongil Bae, Hyunkyung Bang, Jeesoo Lee, Hwanhee Jung, Kyomin
description	To address the data scarcity issue in Conversational question answering (ConvQA), a dialog inpainting method, which utilizes documents to generate ConvQA datasets, has been proposed. However, the original dialog inpainting model is trained solely on the dialog reconstruction task, resulting in the generation of questions with low contextual relevance due to insufficient learning of question-answer alignment. To overcome this limitation, we propose a novel framework called Dialogizer, which has the capability to automatically generate ConvQA datasets with high contextual relevance from textual sources. The framework incorporates two training tasks: question-answer matching (QAM) and topic-aware dialog generation (TDG). Moreover, re-ranking is conducted during the inference phase based on the contextual relevance of the generated questions. Using our framework, we produce four ConvQA datasets by utilizing documents from multiple domains as the primary source. Through automatic evaluation using diverse metrics, as well as human evaluation, we validate that our proposed framework exhibits the ability to generate datasets of higher quality compared to the baseline dialog inpainting model.
doi_str_mv	10.48550/arxiv.2311.07589
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2311_07589</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2311_07589</sourcerecordid><originalsourceid>FETCH-LOGICAL-a679-7bf57c4ab973969bc0ab4cf37ddefce955ea597b4aad6e9329f6e1a10ee1cd143</originalsourceid><addsrcrecordid>eNotj9FKwzAYhXPjhWw-gFfmBVKTJWmW3Y1OpzAQWe_Ln_SPBLpG0m5On15Xd3U4B74DHyH3ghdqqTV_hHyOp2IhhSi40Ut7S_abCF36iD-YV7RK_YjnkcEXZLy0E-YBxph66Nj7mm5ghAFHusUe87TTkNOB1n_QETq6T8fscZiTmwDdgHfXnJH6-amuXtjubftarXcMSmOZcUEbr8BZI21pnefglA_StC0Gj1ZrBG2NUwBtiVYubChRgOCIwrdCyRl5-L-drJrPHA-Qv5uLXTPZyV_ms0xk</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Dialogizer: Context-aware Conversational-QA Dataset Generation from Textual Sources</title><source>arXiv.org</source><creator>Hwang, Yerin ; Kim, Yongil ; Bae, Hyunkyung ; Bang, Jeesoo ; Lee, Hwanhee ; Jung, Kyomin</creator><creatorcontrib>Hwang, Yerin ; Kim, Yongil ; Bae, Hyunkyung ; Bang, Jeesoo ; Lee, Hwanhee ; Jung, Kyomin</creatorcontrib><description>To address the data scarcity issue in Conversational question answering (ConvQA), a dialog inpainting method, which utilizes documents to generate ConvQA datasets, has been proposed. However, the original dialog inpainting model is trained solely on the dialog reconstruction task, resulting in the generation of questions with low contextual relevance due to insufficient learning of question-answer alignment. To overcome this limitation, we propose a novel framework called Dialogizer, which has the capability to automatically generate ConvQA datasets with high contextual relevance from textual sources. The framework incorporates two training tasks: question-answer matching (QAM) and topic-aware dialog generation (TDG). Moreover, re-ranking is conducted during the inference phase based on the contextual relevance of the generated questions. Using our framework, we produce four ConvQA datasets by utilizing documents from multiple domains as the primary source. Through automatic evaluation using diverse metrics, as well as human evaluation, we validate that our proposed framework exhibits the ability to generate datasets of higher quality compared to the baseline dialog inpainting model.</description><identifier>DOI: 10.48550/arxiv.2311.07589</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language</subject><creationdate>2023-11</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2311.07589$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2311.07589$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Hwang, Yerin</creatorcontrib><creatorcontrib>Kim, Yongil</creatorcontrib><creatorcontrib>Bae, Hyunkyung</creatorcontrib><creatorcontrib>Bang, Jeesoo</creatorcontrib><creatorcontrib>Lee, Hwanhee</creatorcontrib><creatorcontrib>Jung, Kyomin</creatorcontrib><title>Dialogizer: Context-aware Conversational-QA Dataset Generation from Textual Sources</title><description>To address the data scarcity issue in Conversational question answering (ConvQA), a dialog inpainting method, which utilizes documents to generate ConvQA datasets, has been proposed. However, the original dialog inpainting model is trained solely on the dialog reconstruction task, resulting in the generation of questions with low contextual relevance due to insufficient learning of question-answer alignment. To overcome this limitation, we propose a novel framework called Dialogizer, which has the capability to automatically generate ConvQA datasets with high contextual relevance from textual sources. The framework incorporates two training tasks: question-answer matching (QAM) and topic-aware dialog generation (TDG). Moreover, re-ranking is conducted during the inference phase based on the contextual relevance of the generated questions. Using our framework, we produce four ConvQA datasets by utilizing documents from multiple domains as the primary source. Through automatic evaluation using diverse metrics, as well as human evaluation, we validate that our proposed framework exhibits the ability to generate datasets of higher quality compared to the baseline dialog inpainting model.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj9FKwzAYhXPjhWw-gFfmBVKTJWmW3Y1OpzAQWe_Ln_SPBLpG0m5On15Xd3U4B74DHyH3ghdqqTV_hHyOp2IhhSi40Ut7S_abCF36iD-YV7RK_YjnkcEXZLy0E-YBxph66Nj7mm5ghAFHusUe87TTkNOB1n_QETq6T8fscZiTmwDdgHfXnJH6-amuXtjubftarXcMSmOZcUEbr8BZI21pnefglA_StC0Gj1ZrBG2NUwBtiVYubChRgOCIwrdCyRl5-L-drJrPHA-Qv5uLXTPZyV_ms0xk</recordid><startdate>20231109</startdate><enddate>20231109</enddate><creator>Hwang, Yerin</creator><creator>Kim, Yongil</creator><creator>Bae, Hyunkyung</creator><creator>Bang, Jeesoo</creator><creator>Lee, Hwanhee</creator><creator>Jung, Kyomin</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231109</creationdate><title>Dialogizer: Context-aware Conversational-QA Dataset Generation from Textual Sources</title><author>Hwang, Yerin ; Kim, Yongil ; Bae, Hyunkyung ; Bang, Jeesoo ; Lee, Hwanhee ; Jung, Kyomin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a679-7bf57c4ab973969bc0ab4cf37ddefce955ea597b4aad6e9329f6e1a10ee1cd143</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><toplevel>online_resources</toplevel><creatorcontrib>Hwang, Yerin</creatorcontrib><creatorcontrib>Kim, Yongil</creatorcontrib><creatorcontrib>Bae, Hyunkyung</creatorcontrib><creatorcontrib>Bang, Jeesoo</creatorcontrib><creatorcontrib>Lee, Hwanhee</creatorcontrib><creatorcontrib>Jung, Kyomin</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Hwang, Yerin</au><au>Kim, Yongil</au><au>Bae, Hyunkyung</au><au>Bang, Jeesoo</au><au>Lee, Hwanhee</au><au>Jung, Kyomin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Dialogizer: Context-aware Conversational-QA Dataset Generation from Textual Sources</atitle><date>2023-11-09</date><risdate>2023</risdate><abstract>To address the data scarcity issue in Conversational question answering (ConvQA), a dialog inpainting method, which utilizes documents to generate ConvQA datasets, has been proposed. However, the original dialog inpainting model is trained solely on the dialog reconstruction task, resulting in the generation of questions with low contextual relevance due to insufficient learning of question-answer alignment. To overcome this limitation, we propose a novel framework called Dialogizer, which has the capability to automatically generate ConvQA datasets with high contextual relevance from textual sources. The framework incorporates two training tasks: question-answer matching (QAM) and topic-aware dialog generation (TDG). Moreover, re-ranking is conducted during the inference phase based on the contextual relevance of the generated questions. Using our framework, we produce four ConvQA datasets by utilizing documents from multiple domains as the primary source. Through automatic evaluation using diverse metrics, as well as human evaluation, we validate that our proposed framework exhibits the ability to generate datasets of higher quality compared to the baseline dialog inpainting model.</abstract><doi>10.48550/arxiv.2311.07589</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2311.07589
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2311_07589
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Computation and Language
title	Dialogizer: Context-aware Conversational-QA Dataset Generation from Textual Sources
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-10T12%3A53%3A34IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Dialogizer:%20Context-aware%20Conversational-QA%20Dataset%20Generation%20from%20Textual%20Sources&rft.au=Hwang,%20Yerin&rft.date=2023-11-09&rft_id=info:doi/10.48550/arxiv.2311.07589&rft_dat=%3Carxiv_GOX%3E2311_07589%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true