KoMultiText: Large-Scale Korean Text Dataset for Classifying Biased Speech in Real-World Online Services
With the growth of online services, the need for advanced text classification algorithms, such as sentiment analysis and biased text detection, has become increasingly evident. The anonymous nature of online services often leads to the presence of biased and harmful language, posing challenges to ma...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | With the growth of online services, the need for advanced text classification
algorithms, such as sentiment analysis and biased text detection, has become
increasingly evident. The anonymous nature of online services often leads to
the presence of biased and harmful language, posing challenges to maintaining
the health of online communities. This phenomenon is especially relevant in
South Korea, where large-scale hate speech detection algorithms have not yet
been broadly explored. In this paper, we introduce "KoMultiText", a new
comprehensive, large-scale dataset collected from a well-known South Korean SNS
platform. Our proposed dataset provides annotations including (1) Preferences,
(2) Profanities, and (3) Nine types of Bias for the text samples, enabling
multi-task learning for simultaneous classification of user-generated texts.
Leveraging state-of-the-art BERT-based language models, our approach surpasses
human-level accuracy across diverse classification tasks, as measured by
various metrics. Beyond academic contributions, our work can provide practical
solutions for real-world hate speech and bias mitigation, contributing directly
to the improvement of online community health. Our work provides a robust
foundation for future research aiming to improve the quality of online
discourse and foster societal well-being. All source codes and datasets are
publicly accessible at https://github.com/Dasol-Choi/KoMultiText. |
---|---|
DOI: | 10.48550/arxiv.2310.04313 |