Topic Segmentation of Semi-Structured and Unstructured Conversational Datasets using Language Models

Breaking down a document or a conversation into multiple contiguous segments based on its semantic structure is an important and challenging problem in NLP, which can assist many downstream tasks. However, current works on topic segmentation often focus on segmentation of structured texts. In this p...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2023-10
Hauptverfasser:	Ghosh, Reshmi, Harjeet Singh Kajal, Kamath, Sharanya, Shrivastava, Dhuri, Basu, Samyadeep, Zeng, Hansi, Soundararajan Srinivasan
Format:	Artikel
Sprache:	eng
Schlagworte:	Datasets Empirical analysis Entropy Segmentation Texts Training Unstructured data
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Ghosh, Reshmi Harjeet Singh Kajal Kamath, Sharanya Shrivastava, Dhuri Basu, Samyadeep Zeng, Hansi Soundararajan Srinivasan
description	Breaking down a document or a conversation into multiple contiguous segments based on its semantic structure is an important and challenging problem in NLP, which can assist many downstream tasks. However, current works on topic segmentation often focus on segmentation of structured texts. In this paper, we comprehensively analyze the generalization capabilities of state-of-the-art topic segmentation models on unstructured texts. We find that: (a) Current strategies of pre-training on a large corpus of structured text such as Wiki-727K do not help in transferability to unstructured conversational data. (b) Training from scratch with only a relatively small-sized dataset of the target unstructured domain improves the segmentation results by a significant margin. We stress-test our proposed Topic Segmentation approach by experimenting with multiple loss functions, in order to mitigate effects of imbalance in unstructured conversational datasets. Our empirical evaluation indicates that Focal Loss function is a robust alternative to Cross-Entropy and re-weighted Cross-Entropy loss function when segmenting unstructured and semi-structured chats.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2882596834</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2882596834</sourcerecordid><originalsourceid>FETCH-proquest_journals_28825968343</originalsourceid><addsrcrecordid>eNqNjEELgjAYhkcQJOV_GHQWbFNbZys61Mk6y3CfMtHN9m39_iSCrp1e3of3eRckYpzvEpExtiIxYp-mKSv2LM95RNTdTrqhFXQjGC-9tobadu6jTirvQuODA0WlUfRh8AdKa17g8CPIgR6llwgeaUBtOnqVpguyA3qzCgbckGUrB4T4m2uyPZ_u5SWZnH0GQF_3Nrj5B2smBMsPheAZ_2_1BpDtR0M</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2882596834</pqid></control><display><type>article</type><title>Topic Segmentation of Semi-Structured and Unstructured Conversational Datasets using Language Models</title><source>Free E- Journals</source><creator>Ghosh, Reshmi ; Harjeet Singh Kajal ; Kamath, Sharanya ; Shrivastava, Dhuri ; Basu, Samyadeep ; Zeng, Hansi ; Soundararajan Srinivasan</creator><creatorcontrib>Ghosh, Reshmi ; Harjeet Singh Kajal ; Kamath, Sharanya ; Shrivastava, Dhuri ; Basu, Samyadeep ; Zeng, Hansi ; Soundararajan Srinivasan</creatorcontrib><description>Breaking down a document or a conversation into multiple contiguous segments based on its semantic structure is an important and challenging problem in NLP, which can assist many downstream tasks. However, current works on topic segmentation often focus on segmentation of structured texts. In this paper, we comprehensively analyze the generalization capabilities of state-of-the-art topic segmentation models on unstructured texts. We find that: (a) Current strategies of pre-training on a large corpus of structured text such as Wiki-727K do not help in transferability to unstructured conversational data. (b) Training from scratch with only a relatively small-sized dataset of the target unstructured domain improves the segmentation results by a significant margin. We stress-test our proposed Topic Segmentation approach by experimenting with multiple loss functions, in order to mitigate effects of imbalance in unstructured conversational datasets. Our empirical evaluation indicates that Focal Loss function is a robust alternative to Cross-Entropy and re-weighted Cross-Entropy loss function when segmenting unstructured and semi-structured chats.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Datasets ; Empirical analysis ; Entropy ; Segmentation ; Texts ; Training ; Unstructured data</subject><ispartof>arXiv.org, 2023-10</ispartof><rights>2023. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Ghosh, Reshmi</creatorcontrib><creatorcontrib>Harjeet Singh Kajal</creatorcontrib><creatorcontrib>Kamath, Sharanya</creatorcontrib><creatorcontrib>Shrivastava, Dhuri</creatorcontrib><creatorcontrib>Basu, Samyadeep</creatorcontrib><creatorcontrib>Zeng, Hansi</creatorcontrib><creatorcontrib>Soundararajan Srinivasan</creatorcontrib><title>Topic Segmentation of Semi-Structured and Unstructured Conversational Datasets using Language Models</title><title>arXiv.org</title><description>Breaking down a document or a conversation into multiple contiguous segments based on its semantic structure is an important and challenging problem in NLP, which can assist many downstream tasks. However, current works on topic segmentation often focus on segmentation of structured texts. In this paper, we comprehensively analyze the generalization capabilities of state-of-the-art topic segmentation models on unstructured texts. We find that: (a) Current strategies of pre-training on a large corpus of structured text such as Wiki-727K do not help in transferability to unstructured conversational data. (b) Training from scratch with only a relatively small-sized dataset of the target unstructured domain improves the segmentation results by a significant margin. We stress-test our proposed Topic Segmentation approach by experimenting with multiple loss functions, in order to mitigate effects of imbalance in unstructured conversational datasets. Our empirical evaluation indicates that Focal Loss function is a robust alternative to Cross-Entropy and re-weighted Cross-Entropy loss function when segmenting unstructured and semi-structured chats.</description><subject>Datasets</subject><subject>Empirical analysis</subject><subject>Entropy</subject><subject>Segmentation</subject><subject>Texts</subject><subject>Training</subject><subject>Unstructured data</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNjEELgjAYhkcQJOV_GHQWbFNbZys61Mk6y3CfMtHN9m39_iSCrp1e3of3eRckYpzvEpExtiIxYp-mKSv2LM95RNTdTrqhFXQjGC-9tobadu6jTirvQuODA0WlUfRh8AdKa17g8CPIgR6llwgeaUBtOnqVpguyA3qzCgbckGUrB4T4m2uyPZ_u5SWZnH0GQF_3Nrj5B2smBMsPheAZ_2_1BpDtR0M</recordid><startdate>20231026</startdate><enddate>20231026</enddate><creator>Ghosh, Reshmi</creator><creator>Harjeet Singh Kajal</creator><creator>Kamath, Sharanya</creator><creator>Shrivastava, Dhuri</creator><creator>Basu, Samyadeep</creator><creator>Zeng, Hansi</creator><creator>Soundararajan Srinivasan</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20231026</creationdate><title>Topic Segmentation of Semi-Structured and Unstructured Conversational Datasets using Language Models</title><author>Ghosh, Reshmi ; Harjeet Singh Kajal ; Kamath, Sharanya ; Shrivastava, Dhuri ; Basu, Samyadeep ; Zeng, Hansi ; Soundararajan Srinivasan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28825968343</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Datasets</topic><topic>Empirical analysis</topic><topic>Entropy</topic><topic>Segmentation</topic><topic>Texts</topic><topic>Training</topic><topic>Unstructured data</topic><toplevel>online_resources</toplevel><creatorcontrib>Ghosh, Reshmi</creatorcontrib><creatorcontrib>Harjeet Singh Kajal</creatorcontrib><creatorcontrib>Kamath, Sharanya</creatorcontrib><creatorcontrib>Shrivastava, Dhuri</creatorcontrib><creatorcontrib>Basu, Samyadeep</creatorcontrib><creatorcontrib>Zeng, Hansi</creatorcontrib><creatorcontrib>Soundararajan Srinivasan</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection (ProQuest)</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ghosh, Reshmi</au><au>Harjeet Singh Kajal</au><au>Kamath, Sharanya</au><au>Shrivastava, Dhuri</au><au>Basu, Samyadeep</au><au>Zeng, Hansi</au><au>Soundararajan Srinivasan</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Topic Segmentation of Semi-Structured and Unstructured Conversational Datasets using Language Models</atitle><jtitle>arXiv.org</jtitle><date>2023-10-26</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>Breaking down a document or a conversation into multiple contiguous segments based on its semantic structure is an important and challenging problem in NLP, which can assist many downstream tasks. However, current works on topic segmentation often focus on segmentation of structured texts. In this paper, we comprehensively analyze the generalization capabilities of state-of-the-art topic segmentation models on unstructured texts. We find that: (a) Current strategies of pre-training on a large corpus of structured text such as Wiki-727K do not help in transferability to unstructured conversational data. (b) Training from scratch with only a relatively small-sized dataset of the target unstructured domain improves the segmentation results by a significant margin. We stress-test our proposed Topic Segmentation approach by experimenting with multiple loss functions, in order to mitigate effects of imbalance in unstructured conversational datasets. Our empirical evaluation indicates that Focal Loss function is a robust alternative to Cross-Entropy and re-weighted Cross-Entropy loss function when segmenting unstructured and semi-structured chats.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2023-10
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2882596834
source	Free E- Journals
subjects	Datasets Empirical analysis Entropy Segmentation Texts Training Unstructured data
title	Topic Segmentation of Semi-Structured and Unstructured Conversational Datasets using Language Models
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-09T11%3A09%3A46IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Topic%20Segmentation%20of%20Semi-Structured%20and%20Unstructured%20Conversational%20Datasets%20using%20Language%20Models&rft.jtitle=arXiv.org&rft.au=Ghosh,%20Reshmi&rft.date=2023-10-26&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2882596834%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2882596834&rft_id=info:pmid/&rfr_iscdi=true