Analysis and Mitigation of Religion Bias in Indonesian Natural Language Processing Datasets
Previous studies have shown the existence of misrepresentation regarding various religious identities in Indonesian media. Misrepresentations of other marginalized identities in natural language processing (NLP) datasets have been recorded to inflict harm against such marginalized identities in case...
Gespeichert in:
Veröffentlicht in: | Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) (Online) 2023-08, Vol.7 (4), p.845-857 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Previous studies have shown the existence of misrepresentation regarding various religious identities in Indonesian media. Misrepresentations of other marginalized identities in natural language processing (NLP) datasets have been recorded to inflict harm against such marginalized identities in cases such as automated content moderation, and as such must be mitigated. In this paper, we analyze, for the first time, several Indonesian NLP datasets to see whether they contain unwanted bias and the effects of debiasing on them. We find that two of the three data sets analyzed in this study contain unwanted bias, whose effects trickle down to downstream performance in the form of allocation and representation harm. The results of debiasing at the dataset level, as a response to the biases previously discovered, are consistently positive for the respective dataset. However, depending on the data set and embedding used to train the model, they vary greatly at the downstream performance level. In particular, the same debiasing technique can decrease bias on a combination of datasets and embedding, yet increase bias on another, particularly in the case of representation harm. |
---|---|
ISSN: | 2580-0760 2580-0760 |
DOI: | 10.29207/resti.v7i4.5035 |