NLPDove at SemEval-2020 Task 12: Improving Offensive Language Detection with Cross-lingual Transfer
This paper describes our approach to the task of identifying offensive languages in a multilingual setting. We investigate two data augmentation strategies: using additional semi-supervised labels with different thresholds and cross-lingual transfer with data selection. Leveraging the semi-supervise...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This paper describes our approach to the task of identifying offensive
languages in a multilingual setting. We investigate two data augmentation
strategies: using additional semi-supervised labels with different thresholds
and cross-lingual transfer with data selection. Leveraging the semi-supervised
dataset resulted in performance improvements compared to the baseline trained
solely with the manually-annotated dataset. We propose a new metric,
Translation Embedding Distance, to measure the transferability of instances for
cross-lingual data selection. We also introduce various preprocessing steps
tailored for social media text along with methods to fine-tune the pre-trained
multilingual BERT (mBERT) for offensive language identification. Our
multilingual systems achieved competitive results in Greek, Danish, and Turkish
at OffensEval 2020. |
---|---|
DOI: | 10.48550/arxiv.2008.01354 |