Topic Segmentation of Semi-Structured and Unstructured Conversational Datasets using Language Models

Breaking down a document or a conversation into multiple contiguous segments based on its semantic structure is an important and challenging problem in NLP, which can assist many downstream tasks. However, current works on topic segmentation often focus on segmentation of structured texts. In this p...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2023-10
Hauptverfasser: Ghosh, Reshmi, Harjeet Singh Kajal, Kamath, Sharanya, Shrivastava, Dhuri, Basu, Samyadeep, Zeng, Hansi, Soundararajan Srinivasan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Ghosh, Reshmi
Harjeet Singh Kajal
Kamath, Sharanya
Shrivastava, Dhuri
Basu, Samyadeep
Zeng, Hansi
Soundararajan Srinivasan
description Breaking down a document or a conversation into multiple contiguous segments based on its semantic structure is an important and challenging problem in NLP, which can assist many downstream tasks. However, current works on topic segmentation often focus on segmentation of structured texts. In this paper, we comprehensively analyze the generalization capabilities of state-of-the-art topic segmentation models on unstructured texts. We find that: (a) Current strategies of pre-training on a large corpus of structured text such as Wiki-727K do not help in transferability to unstructured conversational data. (b) Training from scratch with only a relatively small-sized dataset of the target unstructured domain improves the segmentation results by a significant margin. We stress-test our proposed Topic Segmentation approach by experimenting with multiple loss functions, in order to mitigate effects of imbalance in unstructured conversational datasets. Our empirical evaluation indicates that Focal Loss function is a robust alternative to Cross-Entropy and re-weighted Cross-Entropy loss function when segmenting unstructured and semi-structured chats.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2882596834</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2882596834</sourcerecordid><originalsourceid>FETCH-proquest_journals_28825968343</originalsourceid><addsrcrecordid>eNqNjEELgjAYhkcQJOV_GHQWbFNbZys61Mk6y3CfMtHN9m39_iSCrp1e3of3eRckYpzvEpExtiIxYp-mKSv2LM95RNTdTrqhFXQjGC-9tobadu6jTirvQuODA0WlUfRh8AdKa17g8CPIgR6llwgeaUBtOnqVpguyA3qzCgbckGUrB4T4m2uyPZ_u5SWZnH0GQF_3Nrj5B2smBMsPheAZ_2_1BpDtR0M</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2882596834</pqid></control><display><type>article</type><title>Topic Segmentation of Semi-Structured and Unstructured Conversational Datasets using Language Models</title><source>Free E- Journals</source><creator>Ghosh, Reshmi ; Harjeet Singh Kajal ; Kamath, Sharanya ; Shrivastava, Dhuri ; Basu, Samyadeep ; Zeng, Hansi ; Soundararajan Srinivasan</creator><creatorcontrib>Ghosh, Reshmi ; Harjeet Singh Kajal ; Kamath, Sharanya ; Shrivastava, Dhuri ; Basu, Samyadeep ; Zeng, Hansi ; Soundararajan Srinivasan</creatorcontrib><description>Breaking down a document or a conversation into multiple contiguous segments based on its semantic structure is an important and challenging problem in NLP, which can assist many downstream tasks. However, current works on topic segmentation often focus on segmentation of structured texts. In this paper, we comprehensively analyze the generalization capabilities of state-of-the-art topic segmentation models on unstructured texts. We find that: (a) Current strategies of pre-training on a large corpus of structured text such as Wiki-727K do not help in transferability to unstructured conversational data. (b) Training from scratch with only a relatively small-sized dataset of the target unstructured domain improves the segmentation results by a significant margin. We stress-test our proposed Topic Segmentation approach by experimenting with multiple loss functions, in order to mitigate effects of imbalance in unstructured conversational datasets. Our empirical evaluation indicates that Focal Loss function is a robust alternative to Cross-Entropy and re-weighted Cross-Entropy loss function when segmenting unstructured and semi-structured chats.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Datasets ; Empirical analysis ; Entropy ; Segmentation ; Texts ; Training ; Unstructured data</subject><ispartof>arXiv.org, 2023-10</ispartof><rights>2023. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Ghosh, Reshmi</creatorcontrib><creatorcontrib>Harjeet Singh Kajal</creatorcontrib><creatorcontrib>Kamath, Sharanya</creatorcontrib><creatorcontrib>Shrivastava, Dhuri</creatorcontrib><creatorcontrib>Basu, Samyadeep</creatorcontrib><creatorcontrib>Zeng, Hansi</creatorcontrib><creatorcontrib>Soundararajan Srinivasan</creatorcontrib><title>Topic Segmentation of Semi-Structured and Unstructured Conversational Datasets using Language Models</title><title>arXiv.org</title><description>Breaking down a document or a conversation into multiple contiguous segments based on its semantic structure is an important and challenging problem in NLP, which can assist many downstream tasks. However, current works on topic segmentation often focus on segmentation of structured texts. In this paper, we comprehensively analyze the generalization capabilities of state-of-the-art topic segmentation models on unstructured texts. We find that: (a) Current strategies of pre-training on a large corpus of structured text such as Wiki-727K do not help in transferability to unstructured conversational data. (b) Training from scratch with only a relatively small-sized dataset of the target unstructured domain improves the segmentation results by a significant margin. We stress-test our proposed Topic Segmentation approach by experimenting with multiple loss functions, in order to mitigate effects of imbalance in unstructured conversational datasets. Our empirical evaluation indicates that Focal Loss function is a robust alternative to Cross-Entropy and re-weighted Cross-Entropy loss function when segmenting unstructured and semi-structured chats.</description><subject>Datasets</subject><subject>Empirical analysis</subject><subject>Entropy</subject><subject>Segmentation</subject><subject>Texts</subject><subject>Training</subject><subject>Unstructured data</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNjEELgjAYhkcQJOV_GHQWbFNbZys61Mk6y3CfMtHN9m39_iSCrp1e3of3eRckYpzvEpExtiIxYp-mKSv2LM95RNTdTrqhFXQjGC-9tobadu6jTirvQuODA0WlUfRh8AdKa17g8CPIgR6llwgeaUBtOnqVpguyA3qzCgbckGUrB4T4m2uyPZ_u5SWZnH0GQF_3Nrj5B2smBMsPheAZ_2_1BpDtR0M</recordid><startdate>20231026</startdate><enddate>20231026</enddate><creator>Ghosh, Reshmi</creator><creator>Harjeet Singh Kajal</creator><creator>Kamath, Sharanya</creator><creator>Shrivastava, Dhuri</creator><creator>Basu, Samyadeep</creator><creator>Zeng, Hansi</creator><creator>Soundararajan Srinivasan</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20231026</creationdate><title>Topic Segmentation of Semi-Structured and Unstructured Conversational Datasets using Language Models</title><author>Ghosh, Reshmi ; Harjeet Singh Kajal ; Kamath, Sharanya ; Shrivastava, Dhuri ; Basu, Samyadeep ; Zeng, Hansi ; Soundararajan Srinivasan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28825968343</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Datasets</topic><topic>Empirical analysis</topic><topic>Entropy</topic><topic>Segmentation</topic><topic>Texts</topic><topic>Training</topic><topic>Unstructured data</topic><toplevel>online_resources</toplevel><creatorcontrib>Ghosh, Reshmi</creatorcontrib><creatorcontrib>Harjeet Singh Kajal</creatorcontrib><creatorcontrib>Kamath, Sharanya</creatorcontrib><creatorcontrib>Shrivastava, Dhuri</creatorcontrib><creatorcontrib>Basu, Samyadeep</creatorcontrib><creatorcontrib>Zeng, Hansi</creatorcontrib><creatorcontrib>Soundararajan Srinivasan</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection (ProQuest)</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ghosh, Reshmi</au><au>Harjeet Singh Kajal</au><au>Kamath, Sharanya</au><au>Shrivastava, Dhuri</au><au>Basu, Samyadeep</au><au>Zeng, Hansi</au><au>Soundararajan Srinivasan</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Topic Segmentation of Semi-Structured and Unstructured Conversational Datasets using Language Models</atitle><jtitle>arXiv.org</jtitle><date>2023-10-26</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>Breaking down a document or a conversation into multiple contiguous segments based on its semantic structure is an important and challenging problem in NLP, which can assist many downstream tasks. However, current works on topic segmentation often focus on segmentation of structured texts. In this paper, we comprehensively analyze the generalization capabilities of state-of-the-art topic segmentation models on unstructured texts. We find that: (a) Current strategies of pre-training on a large corpus of structured text such as Wiki-727K do not help in transferability to unstructured conversational data. (b) Training from scratch with only a relatively small-sized dataset of the target unstructured domain improves the segmentation results by a significant margin. We stress-test our proposed Topic Segmentation approach by experimenting with multiple loss functions, in order to mitigate effects of imbalance in unstructured conversational datasets. Our empirical evaluation indicates that Focal Loss function is a robust alternative to Cross-Entropy and re-weighted Cross-Entropy loss function when segmenting unstructured and semi-structured chats.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2023-10
issn 2331-8422
language eng
recordid cdi_proquest_journals_2882596834
source Free E- Journals
subjects Datasets
Empirical analysis
Entropy
Segmentation
Texts
Training
Unstructured data
title Topic Segmentation of Semi-Structured and Unstructured Conversational Datasets using Language Models
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-09T11%3A09%3A46IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Topic%20Segmentation%20of%20Semi-Structured%20and%20Unstructured%20Conversational%20Datasets%20using%20Language%20Models&rft.jtitle=arXiv.org&rft.au=Ghosh,%20Reshmi&rft.date=2023-10-26&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2882596834%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2882596834&rft_id=info:pmid/&rfr_iscdi=true