A Generative Model for Punctuation in Dependency Trees
Treebanks traditionally treat punctuation marks as ordinary words, but linguists have suggested that a tree's "true" punctuation marks are not observed (Nunberg, 1990). These latent "underlying" marks serve to delimit or separate constituents in the syntax tree. When the tre...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Treebanks traditionally treat punctuation marks as ordinary words, but
linguists have suggested that a tree's "true" punctuation marks are not
observed (Nunberg, 1990). These latent "underlying" marks serve to delimit or
separate constituents in the syntax tree. When the tree's yield is rendered as
a written sentence, a string rewriting mechanism transduces the underlying
marks into "surface" marks, which are part of the observed (surface) string but
should not be regarded as part of the tree. We formalize this idea in a
generative model of punctuation that admits efficient dynamic programming. We
train it without observing the underlying marks, by locally maximizing the
incomplete data likelihood (similarly to EM). When we use the trained model to
reconstruct the tree's underlying punctuation, the results appear plausible
across 5 languages, and in particular, are consistent with Nunberg's analysis
of English. We show that our generative model can be used to beat baselines on
punctuation restoration. Also, our reconstruction of a sentence's underlying
punctuation lets us appropriately render the surface punctuation (via our
trained underlying-to-surface mechanism) when we syntactically transform the
sentence. |
---|---|
DOI: | 10.48550/arxiv.1906.11298 |