Identifying multiple languages in a content item

A system for identifying language(s) for content items is disclosed. The system can identify different languages for content item words segments by identifying segment languages that maximize a probability across the segments. The probability can be a combination of: an author's likelihood for...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Park, Seyoung, Pal, Aditya, Funiak, Stanislav, Huang, Fei, Herdagdelen, Amac, Merl, Daniel Matthew
Format: Patent
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:A system for identifying language(s) for content items is disclosed. The system can identify different languages for content item words segments by identifying segment languages that maximize a probability across the segments. The probability can be a combination of: an author's likelihood for the language identified for the first word; a combination of transition frequencies for selected languages identified for words, the transition frequencies indicating likelihoods that a transition occurred to the selected language from the previous word's language; and a combination of observation probabilities indicating, for a given word in the content item, a likelihood the given word is in the identified language. For an in-vocabulary word, the observation probabilities can be based on learned probability for that word. For an out-of-vocabulary word, the probability can be computed by breaking the word into overlapping n-grams and computing combined learned probabilities that each n-gram is in the given language.