KDPII: A New Korean Dialogic Dataset for the Deidentification of Personally Identifiable Information

The rapid growth of social media in the era of big data and artificial intelligence has raised significant safety concerns related to the communication of sensitive personal information. In modern society, awareness of the importance of preserving privacy is growing, so there is a rising advocacy fo...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE access 2024, Vol.12, p.135626-135641
Hauptverfasser:	Fei, Li, Kang, Yejee, Park, Seoyoon, Jang, Yeonji, Lee, Jongkyu, Kim, Hansaem
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial intelligence Big Data Data models Data privacy Datasets General Data Protection Regulation Identification of persons Korean dialogic dataset Language Language model Large language models name entity recognition performance evaluation Personal information personally identifiable information Privacy privacy safety Protection Training
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The rapid growth of social media in the era of big data and artificial intelligence has raised significant safety concerns related to the communication of sensitive personal information. In modern society, awareness of the importance of preserving privacy is growing, so there is a rising advocacy for adopting language modeling technology to mitigate the risk of personal information leakage and to deidentify sensitive information depending on the situation. Thus far, several theoretical analyses of privacy protection in Korea have been conducted. However, the technical development of language model training resources for Korean has been slower than those of widely spoken languages such as English and Chinese. To address this problem, we developed a comprehensive and organized framework for classifying Korean personally identifiable information (PII) by investigating pertinent examples, such as "Text Anonymization Benchmark" and "Network Intrusion Detection Dataset," from within and outside Korea. Subsequently, we created a new Korean dataset for PII deidentification, KDPII, which consists of many conversational texts incorporating plentiful Korean PII. Based on this, we examined the Korean PII processing performances of many representative language models that are available on the market. Finally, we found that although the performance of language models in identifying PII varied by model size, model architecture, and training source, most of them were significantly better at recognizing universal PII than language-specific PII, which indicates a prospective direction of expanding training data for implementing Korean-specific PII deidentification in the future.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2024.3461804