MCRSpell: A metric learning of correct representation for Chinese spelling correction

Chinese spelling correction (CSC) is a difficult but gratifying work that not only helps individuals read and understand the material in their daily lives but also serves as pre-processing for numerous natural language processing (NLP) applications. With pre-trained language models like BERT, many r...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Expert systems with applications 2024-03, Vol.237, p.121513, Article 121513
Hauptverfasser: Li, Chengzhang, Zhang, Ming, Zhang, Xuejun, Yan, Yonghong
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Chinese spelling correction (CSC) is a difficult but gratifying work that not only helps individuals read and understand the material in their daily lives but also serves as pre-processing for numerous natural language processing (NLP) applications. With pre-trained language models like BERT, many recent works have had great success. However, these methods have three limitations. Firstly, they rely on the output of the last layer of the correction model for parameter updates, ignoring the guidance of intermediate features. This can prevent the model from being adequately trained and thus lead to overfitting. Secondly, they use a variety of data augmentations, which depend on expert knowledge. Some specific augmentation rules only improve the performance on the target dataset, and for out-of-set data, the performance decreases instead. Thirdly, due to the nature of statistical models, they tend to transform low-frequency expressions into more common high-frequency expressions, although correct expressions should be preserved as much as possible for error correction tasks. In this work, we propose a novel general framework for CSC that takes advantage of metric learning to learn semantic knowledge adaptively from multiple intermediate features. Our approach only feeds the input text and target text of the parallel corpus to the model separately for the spelling error correction task without any data augmentation. Moreover, we designed a replication mechanism to keep those low-frequency correct expressions instead of over-correcting them. Our suggested solution can greatly outperform the SOTA methods in the CSC problem, according to extensive trials on benchmarks. We will release the source code for further use by the community.
ISSN:0957-4174
1873-6793
DOI:10.1016/j.eswa.2023.121513