Extracting Parallel Paragraphs from Common Crawl

Most of the current methods for mining parallel texts from the web assume that web pages of web sites share same structure across languages. We believe that there still exists a non-negligible amount of parallel data spread across sources not satisfying this assumption. We propose an approach based...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Prague bulletin of mathematical linguistics 2017-04, Vol.107 (1), p.39-56
Hauptverfasser:	Kúdela, Jakub, Holubová, Irena, Bojar, Ondřej
Format:	Artikel
Sprache:	eng
Schlagworte:	Bilingualism Computerized corpora Dictionaries Language Linguistics Paragraphs Parallel corpora Segments Websites
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Schreiben Sie den ersten Kommentar!