Topic Segmentation of Semi-Structured and Unstructured Conversational Datasets using Language Models
Breaking down a document or a conversation into multiple contiguous segments based on its semantic structure is an important and challenging problem in NLP, which can assist many downstream tasks. However, current works on topic segmentation often focus on segmentation of structured texts. In this p...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2023-10 |
---|---|
Hauptverfasser: | , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Ghosh, Reshmi Harjeet Singh Kajal Kamath, Sharanya Shrivastava, Dhuri Basu, Samyadeep Zeng, Hansi Soundararajan Srinivasan |
description | Breaking down a document or a conversation into multiple contiguous segments based on its semantic structure is an important and challenging problem in NLP, which can assist many downstream tasks. However, current works on topic segmentation often focus on segmentation of structured texts. In this paper, we comprehensively analyze the generalization capabilities of state-of-the-art topic segmentation models on unstructured texts. We find that: (a) Current strategies of pre-training on a large corpus of structured text such as Wiki-727K do not help in transferability to unstructured conversational data. (b) Training from scratch with only a relatively small-sized dataset of the target unstructured domain improves the segmentation results by a significant margin. We stress-test our proposed Topic Segmentation approach by experimenting with multiple loss functions, in order to mitigate effects of imbalance in unstructured conversational datasets. Our empirical evaluation indicates that Focal Loss function is a robust alternative to Cross-Entropy and re-weighted Cross-Entropy loss function when segmenting unstructured and semi-structured chats. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2882596834</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2882596834</sourcerecordid><originalsourceid>FETCH-proquest_journals_28825968343</originalsourceid><addsrcrecordid>eNqNjEELgjAYhkcQJOV_GHQWbFNbZys61Mk6y3CfMtHN9m39_iSCrp1e3of3eRckYpzvEpExtiIxYp-mKSv2LM95RNTdTrqhFXQjGC-9tobadu6jTirvQuODA0WlUfRh8AdKa17g8CPIgR6llwgeaUBtOnqVpguyA3qzCgbckGUrB4T4m2uyPZ_u5SWZnH0GQF_3Nrj5B2smBMsPheAZ_2_1BpDtR0M</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2882596834</pqid></control><display><type>article</type><title>Topic Segmentation of Semi-Structured and Unstructured Conversational Datasets using Language Models</title><source>Free E- Journals</source><creator>Ghosh, Reshmi ; Harjeet Singh Kajal ; Kamath, Sharanya ; Shrivastava, Dhuri ; Basu, Samyadeep ; Zeng, Hansi ; Soundararajan Srinivasan</creator><creatorcontrib>Ghosh, Reshmi ; Harjeet Singh Kajal ; Kamath, Sharanya ; Shrivastava, Dhuri ; Basu, Samyadeep ; Zeng, Hansi ; Soundararajan Srinivasan</creatorcontrib><description>Breaking down a document or a conversation into multiple contiguous segments based on its semantic structure is an important and challenging problem in NLP, which can assist many downstream tasks. However, current works on topic segmentation often focus on segmentation of structured texts. In this paper, we comprehensively analyze the generalization capabilities of state-of-the-art topic segmentation models on unstructured texts. We find that: (a) Current strategies of pre-training on a large corpus of structured text such as Wiki-727K do not help in transferability to unstructured conversational data. (b) Training from scratch with only a relatively small-sized dataset of the target unstructured domain improves the segmentation results by a significant margin. We stress-test our proposed Topic Segmentation approach by experimenting with multiple loss functions, in order to mitigate effects of imbalance in unstructured conversational datasets. Our empirical evaluation indicates that Focal Loss function is a robust alternative to Cross-Entropy and re-weighted Cross-Entropy loss function when segmenting unstructured and semi-structured chats.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Datasets ; Empirical analysis ; Entropy ; Segmentation ; Texts ; Training ; Unstructured data</subject><ispartof>arXiv.org, 2023-10</ispartof><rights>2023. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Ghosh, Reshmi</creatorcontrib><creatorcontrib>Harjeet Singh Kajal</creatorcontrib><creatorcontrib>Kamath, Sharanya</creatorcontrib><creatorcontrib>Shrivastava, Dhuri</creatorcontrib><creatorcontrib>Basu, Samyadeep</creatorcontrib><creatorcontrib>Zeng, Hansi</creatorcontrib><creatorcontrib>Soundararajan Srinivasan</creatorcontrib><title>Topic Segmentation of Semi-Structured and Unstructured Conversational Datasets using Language Models</title><title>arXiv.org</title><description>Breaking down a document or a conversation into multiple contiguous segments based on its semantic structure is an important and challenging problem in NLP, which can assist many downstream tasks. However, current works on topic segmentation often focus on segmentation of structured texts. In this paper, we comprehensively analyze the generalization capabilities of state-of-the-art topic segmentation models on unstructured texts. We find that: (a) Current strategies of pre-training on a large corpus of structured text such as Wiki-727K do not help in transferability to unstructured conversational data. (b) Training from scratch with only a relatively small-sized dataset of the target unstructured domain improves the segmentation results by a significant margin. We stress-test our proposed Topic Segmentation approach by experimenting with multiple loss functions, in order to mitigate effects of imbalance in unstructured conversational datasets. Our empirical evaluation indicates that Focal Loss function is a robust alternative to Cross-Entropy and re-weighted Cross-Entropy loss function when segmenting unstructured and semi-structured chats.</description><subject>Datasets</subject><subject>Empirical analysis</subject><subject>Entropy</subject><subject>Segmentation</subject><subject>Texts</subject><subject>Training</subject><subject>Unstructured data</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNjEELgjAYhkcQJOV_GHQWbFNbZys61Mk6y3CfMtHN9m39_iSCrp1e3of3eRckYpzvEpExtiIxYp-mKSv2LM95RNTdTrqhFXQjGC-9tobadu6jTirvQuODA0WlUfRh8AdKa17g8CPIgR6llwgeaUBtOnqVpguyA3qzCgbckGUrB4T4m2uyPZ_u5SWZnH0GQF_3Nrj5B2smBMsPheAZ_2_1BpDtR0M</recordid><startdate>20231026</startdate><enddate>20231026</enddate><creator>Ghosh, Reshmi</creator><creator>Harjeet Singh Kajal</creator><creator>Kamath, Sharanya</creator><creator>Shrivastava, Dhuri</creator><creator>Basu, Samyadeep</creator><creator>Zeng, Hansi</creator><creator>Soundararajan Srinivasan</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20231026</creationdate><title>Topic Segmentation of Semi-Structured and Unstructured Conversational Datasets using Language Models</title><author>Ghosh, Reshmi ; Harjeet Singh Kajal ; Kamath, Sharanya ; Shrivastava, Dhuri ; Basu, Samyadeep ; Zeng, Hansi ; Soundararajan Srinivasan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28825968343</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Datasets</topic><topic>Empirical analysis</topic><topic>Entropy</topic><topic>Segmentation</topic><topic>Texts</topic><topic>Training</topic><topic>Unstructured data</topic><toplevel>online_resources</toplevel><creatorcontrib>Ghosh, Reshmi</creatorcontrib><creatorcontrib>Harjeet Singh Kajal</creatorcontrib><creatorcontrib>Kamath, Sharanya</creatorcontrib><creatorcontrib>Shrivastava, Dhuri</creatorcontrib><creatorcontrib>Basu, Samyadeep</creatorcontrib><creatorcontrib>Zeng, Hansi</creatorcontrib><creatorcontrib>Soundararajan Srinivasan</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection (ProQuest)</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ghosh, Reshmi</au><au>Harjeet Singh Kajal</au><au>Kamath, Sharanya</au><au>Shrivastava, Dhuri</au><au>Basu, Samyadeep</au><au>Zeng, Hansi</au><au>Soundararajan Srinivasan</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Topic Segmentation of Semi-Structured and Unstructured Conversational Datasets using Language Models</atitle><jtitle>arXiv.org</jtitle><date>2023-10-26</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>Breaking down a document or a conversation into multiple contiguous segments based on its semantic structure is an important and challenging problem in NLP, which can assist many downstream tasks. However, current works on topic segmentation often focus on segmentation of structured texts. In this paper, we comprehensively analyze the generalization capabilities of state-of-the-art topic segmentation models on unstructured texts. We find that: (a) Current strategies of pre-training on a large corpus of structured text such as Wiki-727K do not help in transferability to unstructured conversational data. (b) Training from scratch with only a relatively small-sized dataset of the target unstructured domain improves the segmentation results by a significant margin. We stress-test our proposed Topic Segmentation approach by experimenting with multiple loss functions, in order to mitigate effects of imbalance in unstructured conversational datasets. Our empirical evaluation indicates that Focal Loss function is a robust alternative to Cross-Entropy and re-weighted Cross-Entropy loss function when segmenting unstructured and semi-structured chats.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2023-10 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_2882596834 |
source | Free E- Journals |
subjects | Datasets Empirical analysis Entropy Segmentation Texts Training Unstructured data |
title | Topic Segmentation of Semi-Structured and Unstructured Conversational Datasets using Language Models |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-09T11%3A09%3A46IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Topic%20Segmentation%20of%20Semi-Structured%20and%20Unstructured%20Conversational%20Datasets%20using%20Language%20Models&rft.jtitle=arXiv.org&rft.au=Ghosh,%20Reshmi&rft.date=2023-10-26&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2882596834%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2882596834&rft_id=info:pmid/&rfr_iscdi=true |