Enhancing pre-trained language models with Chinese character morphological knowledge

Pre-trained language models (PLMs) have demonstrated success in Chinese natural language processing (NLP) tasks by acquiring high-quality representations through contextual learning. However, these models tend to neglect the glyph features of Chinese characters, which contain valuable semantic knowl...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Information processing & management 2025-01, Vol.62 (1), p.103945, Article 103945
Hauptverfasser: Zheng, Zhenzhong, Wu, Xiaoming, Liu, Xiangzhi
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Pre-trained language models (PLMs) have demonstrated success in Chinese natural language processing (NLP) tasks by acquiring high-quality representations through contextual learning. However, these models tend to neglect the glyph features of Chinese characters, which contain valuable semantic knowledge. To address this issue, this paper introduces a self-supervised learning strategy, named SGBERT, aiming to learn high-quality semantic knowledge from Chinese Character morphology to enhance PLMs’ understanding of natural language. Specifically, the learning process of SGBERT can be divided into two stages. In the first stage, we preheat the glyph encoder by constructing contrastive learning between glyphs, enabling it to obtain preliminary glyph coding capabilities. In the second stage, we transform the glyph features captured by the glyph encoder into context-sensitive representations through a glyph-aware window. These representations are then contrasted with the character representations generated by the PLMs, leveraging the powerful representation capabilities of the PLMs to guide glyph learning. Finally, the glyph knowledge is fused with the pre-trained model representations to obtain semantically richer representations. We conduct experiments on ten datasets covering six Chinese NLP tasks, and the results demonstrate that SGBERT significantly enhances commonly used Chinese PLMs. On average, the introduction of SGBERT resulted in a performance improvement of 1.36% for BERT and 1.09% for RoBERTa.
ISSN:0306-4573
DOI:10.1016/j.ipm.2024.103945