CoRTEx: Contrastive Learning for Representing Terms via Explanations with Applications on Constructing Biomedical Knowledge Graphs
Objective: Biomedical Knowledge Graphs play a pivotal role in various biomedical research domains. Concurrently, term clustering emerges as a crucial step in constructing these knowledge graphs, aiming to identify synonymous terms. Due to a lack of knowledge, previous contrastive learning models tra...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Objective: Biomedical Knowledge Graphs play a pivotal role in various
biomedical research domains. Concurrently, term clustering emerges as a crucial
step in constructing these knowledge graphs, aiming to identify synonymous
terms. Due to a lack of knowledge, previous contrastive learning models trained
with Unified Medical Language System (UMLS) synonyms struggle at clustering
difficult terms and do not generalize well beyond UMLS terms. In this work, we
leverage the world knowledge from Large Language Models (LLMs) and propose
Contrastive Learning for Representing Terms via Explanations (CoRTEx) to
enhance term representation and significantly improves term clustering.
Materials and Methods: The model training involves generating explanations for
a cleaned subset of UMLS terms using ChatGPT. We employ contrastive learning,
considering term and explanation embeddings simultaneously, and progressively
introduce hard negative samples. Additionally, a ChatGPT-assisted BIRCH
algorithm is designed for efficient clustering of a new ontology. Results: We
established a clustering test set and a hard negative test set, where our model
consistently achieves the highest F1 score. With CoRTEx embeddings and the
modified BIRCH algorithm, we grouped 35,580,932 terms from the Biomedical
Informatics Ontology System (BIOS) into 22,104,559 clusters with O(N) queries
to ChatGPT. Case studies highlight the model's efficacy in handling challenging
samples, aided by information from explanations. Conclusion: By aligning terms
to their explanations, CoRTEx demonstrates superior accuracy over benchmark
models and robustness beyond its training set, and it is suitable for
clustering terms for large-scale biomedical ontologies. |
---|---|
DOI: | 10.48550/arxiv.2312.08036 |