A Generative Model for Punctuation in Dependency Trees

Treebanks traditionally treat punctuation marks as ordinary words, but linguists have suggested that a tree's "true" punctuation marks are not observed (Nunberg, 1990). These latent "underlying" marks serve to delimit or separate constituents in the syntax tree. When the tre...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Li, Xiang Lisa, Wang, Dingquan, Eisner, Jason
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Li, Xiang Lisa
Wang, Dingquan
Eisner, Jason
description Treebanks traditionally treat punctuation marks as ordinary words, but linguists have suggested that a tree's "true" punctuation marks are not observed (Nunberg, 1990). These latent "underlying" marks serve to delimit or separate constituents in the syntax tree. When the tree's yield is rendered as a written sentence, a string rewriting mechanism transduces the underlying marks into "surface" marks, which are part of the observed (surface) string but should not be regarded as part of the tree. We formalize this idea in a generative model of punctuation that admits efficient dynamic programming. We train it without observing the underlying marks, by locally maximizing the incomplete data likelihood (similarly to EM). When we use the trained model to reconstruct the tree's underlying punctuation, the results appear plausible across 5 languages, and in particular, are consistent with Nunberg's analysis of English. We show that our generative model can be used to beat baselines on punctuation restoration. Also, our reconstruction of a sentence's underlying punctuation lets us appropriately render the surface punctuation (via our trained underlying-to-surface mechanism) when we syntactically transform the sentence.
doi_str_mv 10.48550/arxiv.1906.11298
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_1906_11298</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>1906_11298</sourcerecordid><originalsourceid>FETCH-LOGICAL-a678-ef038ed6a91c58419654bc64d0b13202a13074fde77037d9375698da5ad41f7b3</originalsourceid><addsrcrecordid>eNotj71uwjAURr0wVCkP0Am_QFLf-H9EUCgSVRmyRzfxtRSJOsj8CN6eljId6RuOvsPYG4hKOa3FO-brcKnAC1MB1N69MDPna0qU8TRciH-NgfY8jpnvzqk_nX_XMfEh8SUdKAVK_Y03mej4yiYR90eaPlmwZvXRLD7L7fd6s5hvSzTWlRSFdBQMeui1U-CNVl1vVBAdyFrUCFJYFQNZK6QNXlptvAuoMSiItpMFm_1rH8fbQx5-MN_av4D2ESDvg80_Fg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>A Generative Model for Punctuation in Dependency Trees</title><source>arXiv.org</source><creator>Li, Xiang Lisa ; Wang, Dingquan ; Eisner, Jason</creator><creatorcontrib>Li, Xiang Lisa ; Wang, Dingquan ; Eisner, Jason</creatorcontrib><description>Treebanks traditionally treat punctuation marks as ordinary words, but linguists have suggested that a tree's "true" punctuation marks are not observed (Nunberg, 1990). These latent "underlying" marks serve to delimit or separate constituents in the syntax tree. When the tree's yield is rendered as a written sentence, a string rewriting mechanism transduces the underlying marks into "surface" marks, which are part of the observed (surface) string but should not be regarded as part of the tree. We formalize this idea in a generative model of punctuation that admits efficient dynamic programming. We train it without observing the underlying marks, by locally maximizing the incomplete data likelihood (similarly to EM). When we use the trained model to reconstruct the tree's underlying punctuation, the results appear plausible across 5 languages, and in particular, are consistent with Nunberg's analysis of English. We show that our generative model can be used to beat baselines on punctuation restoration. Also, our reconstruction of a sentence's underlying punctuation lets us appropriately render the surface punctuation (via our trained underlying-to-surface mechanism) when we syntactically transform the sentence.</description><identifier>DOI: 10.48550/arxiv.1906.11298</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Learning</subject><creationdate>2019-06</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/1906.11298$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.1906.11298$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Li, Xiang Lisa</creatorcontrib><creatorcontrib>Wang, Dingquan</creatorcontrib><creatorcontrib>Eisner, Jason</creatorcontrib><title>A Generative Model for Punctuation in Dependency Trees</title><description>Treebanks traditionally treat punctuation marks as ordinary words, but linguists have suggested that a tree's "true" punctuation marks are not observed (Nunberg, 1990). These latent "underlying" marks serve to delimit or separate constituents in the syntax tree. When the tree's yield is rendered as a written sentence, a string rewriting mechanism transduces the underlying marks into "surface" marks, which are part of the observed (surface) string but should not be regarded as part of the tree. We formalize this idea in a generative model of punctuation that admits efficient dynamic programming. We train it without observing the underlying marks, by locally maximizing the incomplete data likelihood (similarly to EM). When we use the trained model to reconstruct the tree's underlying punctuation, the results appear plausible across 5 languages, and in particular, are consistent with Nunberg's analysis of English. We show that our generative model can be used to beat baselines on punctuation restoration. Also, our reconstruction of a sentence's underlying punctuation lets us appropriately render the surface punctuation (via our trained underlying-to-surface mechanism) when we syntactically transform the sentence.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2019</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj71uwjAURr0wVCkP0Am_QFLf-H9EUCgSVRmyRzfxtRSJOsj8CN6eljId6RuOvsPYG4hKOa3FO-brcKnAC1MB1N69MDPna0qU8TRciH-NgfY8jpnvzqk_nX_XMfEh8SUdKAVK_Y03mej4yiYR90eaPlmwZvXRLD7L7fd6s5hvSzTWlRSFdBQMeui1U-CNVl1vVBAdyFrUCFJYFQNZK6QNXlptvAuoMSiItpMFm_1rH8fbQx5-MN_av4D2ESDvg80_Fg</recordid><startdate>20190626</startdate><enddate>20190626</enddate><creator>Li, Xiang Lisa</creator><creator>Wang, Dingquan</creator><creator>Eisner, Jason</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20190626</creationdate><title>A Generative Model for Punctuation in Dependency Trees</title><author>Li, Xiang Lisa ; Wang, Dingquan ; Eisner, Jason</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a678-ef038ed6a91c58419654bc64d0b13202a13074fde77037d9375698da5ad41f7b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2019</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Li, Xiang Lisa</creatorcontrib><creatorcontrib>Wang, Dingquan</creatorcontrib><creatorcontrib>Eisner, Jason</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Li, Xiang Lisa</au><au>Wang, Dingquan</au><au>Eisner, Jason</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Generative Model for Punctuation in Dependency Trees</atitle><date>2019-06-26</date><risdate>2019</risdate><abstract>Treebanks traditionally treat punctuation marks as ordinary words, but linguists have suggested that a tree's "true" punctuation marks are not observed (Nunberg, 1990). These latent "underlying" marks serve to delimit or separate constituents in the syntax tree. When the tree's yield is rendered as a written sentence, a string rewriting mechanism transduces the underlying marks into "surface" marks, which are part of the observed (surface) string but should not be regarded as part of the tree. We formalize this idea in a generative model of punctuation that admits efficient dynamic programming. We train it without observing the underlying marks, by locally maximizing the incomplete data likelihood (similarly to EM). When we use the trained model to reconstruct the tree's underlying punctuation, the results appear plausible across 5 languages, and in particular, are consistent with Nunberg's analysis of English. We show that our generative model can be used to beat baselines on punctuation restoration. Also, our reconstruction of a sentence's underlying punctuation lets us appropriately render the surface punctuation (via our trained underlying-to-surface mechanism) when we syntactically transform the sentence.</abstract><doi>10.48550/arxiv.1906.11298</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.1906.11298
ispartof
issn
language eng
recordid cdi_arxiv_primary_1906_11298
source arXiv.org
subjects Computer Science - Computation and Language
Computer Science - Learning
title A Generative Model for Punctuation in Dependency Trees
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T23%3A42%3A35IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Generative%20Model%20for%20Punctuation%20in%20Dependency%20Trees&rft.au=Li,%20Xiang%20Lisa&rft.date=2019-06-26&rft_id=info:doi/10.48550/arxiv.1906.11298&rft_dat=%3Carxiv_GOX%3E1906_11298%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true