COMPOUND SPLITTING

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for decompounding compound words are disclosed. In one aspect, a method includes obtaining a token that includes a sequence of characters, identifying two or more candidate sub-words that are constitue...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: MACHEREY, Klaus, OCH, Franz, Josef, TALBOT, David, R, POPAT, Ashok, C, DAI, Andrew, M
Format: Patent
Sprache:eng ; fre ; ger
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for decompounding compound words are disclosed. In one aspect, a method includes obtaining a token that includes a sequence of characters, identifying two or more candidate sub-words that are constituents of the token, and one or more morphological operations that are required to transform the sub-words into the token, where at least one of the morphological operations involves a use of a non-dictionary word, and determining a cost associated with each sub-word and a cost associated with each morphological operation.