Supplementing domain knowledge to BERT with semi-structured information of documents
Domain adaptation is a good way to boost BERT’s performance on domain-specific natural language processing (NLP) tasks. Common domain adaptation methods, however, can be deficient in capturing domain knowledge. Meanwhile, the context fragmentation inherent in Transformer-based models also hinders th...
Gespeichert in:
Veröffentlicht in: | Expert systems with applications 2024-01, Vol.235, p.121054, Article 121054 |
---|---|
Hauptverfasser: | , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Domain adaptation is a good way to boost BERT’s performance on domain-specific natural language processing (NLP) tasks. Common domain adaptation methods, however, can be deficient in capturing domain knowledge. Meanwhile, the context fragmentation inherent in Transformer-based models also hinders the acquisition of domain knowledge. Considering the semi-structural characteristics of documents and its potential for alleviating these problems, we leverage the semi-structured information of documents to supplement domain knowledge to BERT. To this end, we propose a topic-based domain adaptation method, which enhances the capture of domain knowledge at various levels of text granularity. Specifically, topic masked language modeling is designed at the paragraph level for pre-training; topic subsection matching degree dataset is automatically constructed at the subsection level for intermediate fine-tuning. Experiments are conducted over four biomedical NLP tasks across six datasets. The results show that our method benefits BERT, RoBERTa, SpanBERT, BioBERT, and PubMedBERT in nearly all cases. And we see significant gains in two question answering (QA) tasks, especially customer health QA, the topic-related one, with an average accuracy improvement of 4.8%. Thus, the semi-structured information of documents can be exploited to make BERT capture domain knowledge more effectively.
•Topic-based domain adaptation (TDA) is proposed for BERT.•Topic masked language model (TMLM) is designed for pretraining.•Topic subsection matching degree (TSMD) is created for intermediate fine-tuning.•The semi-structured information is critical for domain adaptation of BERT.•TDA helps BERT yield significant gains on customer health QA task. |
---|---|
ISSN: | 0957-4174 |
DOI: | 10.1016/j.eswa.2023.121054 |