A realistic and robust model for Chinese word segmentation

A realistic Chinese word segmentation tool must adapt to textual variations with minimal training input and yet robust enough to yield reliable segmentation result for all variants. Various lexicon-driven approaches to Chinese segmentation, e.g. [1,16], achieve high f-scores yet require massive trai...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Huang, Chu-Ren, Yo, Ting-Shuo, Simon, Petr, Hsieh, Shu-Kai
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language Computer Science - Information Retrieval Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Huang, Chu-Ren Yo, Ting-Shuo Simon, Petr Hsieh, Shu-Kai
description	A realistic Chinese word segmentation tool must adapt to textual variations with minimal training input and yet robust enough to yield reliable segmentation result for all variants. Various lexicon-driven approaches to Chinese segmentation, e.g. [1,16], achieve high f-scores yet require massive training for any variation. Text-driven approach, e.g. [12], can be easily adapted for domain and genre changes yet has difficulty matching the high f-scores of the lexicon-driven approaches. In this paper, we refine and implement an innovative text-driven word boundary decision (WBD) segmentation model proposed in [15]. The WBD model treats word segmentation simply and efficiently as a binary decision on whether to realize the natural textual break between two adjacent characters as a word boundary. The WBD model allows simple and quick training data preparation converting characters as contextual vectors for learning the word boundary decision. Machine learning experiments with four different classifiers show that training with 1,000 vectors and 1 million vectors achieve comparable and reliable results. In addition, when applied to SigHAN Bakeoff 3 competition data, the WBD model produces OOV recall rates that are higher than all published results. Unlike all previous work, our OOV recall rate is comparable to our own F-score. Both experiments support the claim that the WBD model is a realistic model for Chinese word segmentation as it can be easily adapted for new variants with the robust result. In conclusion, we will discuss linguistic ramifications as well as future implications for the WBD approach.
doi_str_mv	10.48550/arxiv.1905.08732
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_1905_08732</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>1905_08732</sourcerecordid><originalsourceid>FETCH-LOGICAL-a672-7ac13dad0cfe39b97880eaf946148e395c3abeaeefb60c240c152d91e0fb9c323</originalsourceid><addsrcrecordid>eNotj7tOwzAYRr0woMIDMOEXSOpLnNjdqqhcpEpduke_7d_UUhJXdqDw9kBhOtI3HH2HkAfO6kYrxdaQP-NHzQ1TNdOdFLdks6UZYYxliY7C7GlO9r0sdEoeRxpSpv0pzliQXlL2tODbhPMCS0zzHbkJMBa8_-eKHJ92x_6l2h-eX_vtvoK2E1UHjksPnrmA0ljTac0Qgmla3uifRTkJFgEx2JY50TDHlfCGIwvWOCnkijz-aa_nh3OOE-Sv4TdiuEbIb2GbQmQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>A realistic and robust model for Chinese word segmentation</title><source>arXiv.org</source><creator>Huang, Chu-Ren ; Yo, Ting-Shuo ; Simon, Petr ; Hsieh, Shu-Kai</creator><creatorcontrib>Huang, Chu-Ren ; Yo, Ting-Shuo ; Simon, Petr ; Hsieh, Shu-Kai</creatorcontrib><description>A realistic Chinese word segmentation tool must adapt to textual variations with minimal training input and yet robust enough to yield reliable segmentation result for all variants. Various lexicon-driven approaches to Chinese segmentation, e.g. [1,16], achieve high f-scores yet require massive training for any variation. Text-driven approach, e.g. [12], can be easily adapted for domain and genre changes yet has difficulty matching the high f-scores of the lexicon-driven approaches. In this paper, we refine and implement an innovative text-driven word boundary decision (WBD) segmentation model proposed in [15]. The WBD model treats word segmentation simply and efficiently as a binary decision on whether to realize the natural textual break between two adjacent characters as a word boundary. The WBD model allows simple and quick training data preparation converting characters as contextual vectors for learning the word boundary decision. Machine learning experiments with four different classifiers show that training with 1,000 vectors and 1 million vectors achieve comparable and reliable results. In addition, when applied to SigHAN Bakeoff 3 competition data, the WBD model produces OOV recall rates that are higher than all published results. Unlike all previous work, our OOV recall rate is comparable to our own F-score. Both experiments support the claim that the WBD model is a realistic model for Chinese word segmentation as it can be easily adapted for new variants with the robust result. In conclusion, we will discuss linguistic ramifications as well as future implications for the WBD approach.</description><identifier>DOI: 10.48550/arxiv.1905.08732</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Information Retrieval ; Computer Science - Learning</subject><creationdate>2019-05</creationdate><rights>http://creativecommons.org/licenses/by-sa/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/1905.08732$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.1905.08732$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Huang, Chu-Ren</creatorcontrib><creatorcontrib>Yo, Ting-Shuo</creatorcontrib><creatorcontrib>Simon, Petr</creatorcontrib><creatorcontrib>Hsieh, Shu-Kai</creatorcontrib><title>A realistic and robust model for Chinese word segmentation</title><description>A realistic Chinese word segmentation tool must adapt to textual variations with minimal training input and yet robust enough to yield reliable segmentation result for all variants. Various lexicon-driven approaches to Chinese segmentation, e.g. [1,16], achieve high f-scores yet require massive training for any variation. Text-driven approach, e.g. [12], can be easily adapted for domain and genre changes yet has difficulty matching the high f-scores of the lexicon-driven approaches. In this paper, we refine and implement an innovative text-driven word boundary decision (WBD) segmentation model proposed in [15]. The WBD model treats word segmentation simply and efficiently as a binary decision on whether to realize the natural textual break between two adjacent characters as a word boundary. The WBD model allows simple and quick training data preparation converting characters as contextual vectors for learning the word boundary decision. Machine learning experiments with four different classifiers show that training with 1,000 vectors and 1 million vectors achieve comparable and reliable results. In addition, when applied to SigHAN Bakeoff 3 competition data, the WBD model produces OOV recall rates that are higher than all published results. Unlike all previous work, our OOV recall rate is comparable to our own F-score. Both experiments support the claim that the WBD model is a realistic model for Chinese word segmentation as it can be easily adapted for new variants with the robust result. In conclusion, we will discuss linguistic ramifications as well as future implications for the WBD approach.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Information Retrieval</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2019</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj7tOwzAYRr0woMIDMOEXSOpLnNjdqqhcpEpduke_7d_UUhJXdqDw9kBhOtI3HH2HkAfO6kYrxdaQP-NHzQ1TNdOdFLdks6UZYYxliY7C7GlO9r0sdEoeRxpSpv0pzliQXlL2tODbhPMCS0zzHbkJMBa8_-eKHJ92x_6l2h-eX_vtvoK2E1UHjksPnrmA0ljTac0Qgmla3uifRTkJFgEx2JY50TDHlfCGIwvWOCnkijz-aa_nh3OOE-Sv4TdiuEbIb2GbQmQ</recordid><startdate>20190521</startdate><enddate>20190521</enddate><creator>Huang, Chu-Ren</creator><creator>Yo, Ting-Shuo</creator><creator>Simon, Petr</creator><creator>Hsieh, Shu-Kai</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20190521</creationdate><title>A realistic and robust model for Chinese word segmentation</title><author>Huang, Chu-Ren ; Yo, Ting-Shuo ; Simon, Petr ; Hsieh, Shu-Kai</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a672-7ac13dad0cfe39b97880eaf946148e395c3abeaeefb60c240c152d91e0fb9c323</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2019</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Information Retrieval</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Huang, Chu-Ren</creatorcontrib><creatorcontrib>Yo, Ting-Shuo</creatorcontrib><creatorcontrib>Simon, Petr</creatorcontrib><creatorcontrib>Hsieh, Shu-Kai</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Huang, Chu-Ren</au><au>Yo, Ting-Shuo</au><au>Simon, Petr</au><au>Hsieh, Shu-Kai</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A realistic and robust model for Chinese word segmentation</atitle><date>2019-05-21</date><risdate>2019</risdate><abstract>A realistic Chinese word segmentation tool must adapt to textual variations with minimal training input and yet robust enough to yield reliable segmentation result for all variants. Various lexicon-driven approaches to Chinese segmentation, e.g. [1,16], achieve high f-scores yet require massive training for any variation. Text-driven approach, e.g. [12], can be easily adapted for domain and genre changes yet has difficulty matching the high f-scores of the lexicon-driven approaches. In this paper, we refine and implement an innovative text-driven word boundary decision (WBD) segmentation model proposed in [15]. The WBD model treats word segmentation simply and efficiently as a binary decision on whether to realize the natural textual break between two adjacent characters as a word boundary. The WBD model allows simple and quick training data preparation converting characters as contextual vectors for learning the word boundary decision. Machine learning experiments with four different classifiers show that training with 1,000 vectors and 1 million vectors achieve comparable and reliable results. In addition, when applied to SigHAN Bakeoff 3 competition data, the WBD model produces OOV recall rates that are higher than all published results. Unlike all previous work, our OOV recall rate is comparable to our own F-score. Both experiments support the claim that the WBD model is a realistic model for Chinese word segmentation as it can be easily adapted for new variants with the robust result. In conclusion, we will discuss linguistic ramifications as well as future implications for the WBD approach.</abstract><doi>10.48550/arxiv.1905.08732</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.1905.08732
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_1905_08732
source	arXiv.org
subjects	Computer Science - Computation and Language Computer Science - Information Retrieval Computer Science - Learning
title	A realistic and robust model for Chinese word segmentation
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-08T14%3A55%3A30IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20realistic%20and%20robust%20model%20for%20Chinese%20word%20segmentation&rft.au=Huang,%20Chu-Ren&rft.date=2019-05-21&rft_id=info:doi/10.48550/arxiv.1905.08732&rft_dat=%3Carxiv_GOX%3E1905_08732%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true