Analysis and Mitigation of Religion Bias in Indonesian Natural Language Processing Datasets

Previous studies have shown the existence of misrepresentation regarding various religious identities in Indonesian media. Misrepresentations of other marginalized identities in natural language processing (NLP) datasets have been recorded to inflict harm against such marginalized identities in case...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) (Online) 2023-08, Vol.7 (4), p.845-857
Hauptverfasser:	Fauzan, Muhammad Arief, Saptawijaya, Ari
Format:	Artikel
Sprache:	eng
Schlagworte:	debiasing indonesian nlp natural language processing social bias
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Previous studies have shown the existence of misrepresentation regarding various religious identities in Indonesian media. Misrepresentations of other marginalized identities in natural language processing (NLP) datasets have been recorded to inflict harm against such marginalized identities in cases such as automated content moderation, and as such must be mitigated. In this paper, we analyze, for the first time, several Indonesian NLP datasets to see whether they contain unwanted bias and the effects of debiasing on them. We find that two of the three data sets analyzed in this study contain unwanted bias, whose effects trickle down to downstream performance in the form of allocation and representation harm. The results of debiasing at the dataset level, as a response to the biases previously discovered, are consistently positive for the respective dataset. However, depending on the data set and embedding used to train the model, they vary greatly at the downstream performance level. In particular, the same debiasing technique can decrease bias on a combination of datasets and embedding, yet increase bias on another, particularly in the case of representation harm.
ISSN:	2580-0760 2580-0760
DOI:	10.29207/resti.v7i4.5035