LTP: A New Active Learning Strategy for CRF-Based Named Entity Recognition

In recent years, deep learning has achieved great success in many natural language processing tasks, including named entity recognition. The shortcoming is that a large quantity of manually annotated data is usually required. Previous studies have demonstrated that active learning can considerably r...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Neural processing letters 2022-06, Vol.54 (3), p.2433-2454
Hauptverfasser: Liu, Mingyi, Tu, Zhiying, Zhang, Tong, Su, Tonghua, Xu, Xiaofei, Wang, Zhongjie
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In recent years, deep learning has achieved great success in many natural language processing tasks, including named entity recognition. The shortcoming is that a large quantity of manually annotated data is usually required. Previous studies have demonstrated that active learning can considerably reduce the cost of data annotation, but there is still plenty of room for improvement. In real applications, we found that existing uncertainty-based active learning strategies have two shortcomings. First, these strategies prefer to choose long sequences explicitly or implicitly, which increases the annotation burden of annotators. Second, some strategies need to revise and modify the model to generate additional information for sample selection, which increases the workload of the developer and increases the training/prediction time of the model. In this paper, we first examine traditional active learning strategies in specific cases of Word2Vec-BiLSTM-CRF and Bert-CRF that have been widely used in named entity recognition on several typical datasets. Then, we propose an uncertainty-based active learning strategy called the lowest token probability (LTP), which combines the input and output of conditional random field (CRF) to select informative instances. LTP is a simple and powerful strategy that does not favor long sequences and does not need to revise the model. We test LTP on multiple real-world datasets, the experiment results show that compared with existing state-of-the-art selection strategies, LTP can reduce about 20% annotation tokens while maintaining competitive performance on both sentence-level accuracy and entity-level F1-score. Additionally, LTP significantly outperformed all other strategies in selecting valid samples, which dramatically reduced the invalid annotation times of the labelers.
ISSN:1370-4621
1573-773X
DOI:10.1007/s11063-021-10737-x