Detecting duplicate biological entities using Markov random field-based edit distance

Detecting duplicate entities in biological data is an important research task. In this paper, we propose a novel and context-sensitive Markov random field-based edit distance (MRFED) for this task. We apply the Markov random field theory to the Needleman–Wunsch distance and combine MRFED with TFIDF,...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Knowledge and information systems 2010-11, Vol.25 (2), p.371-387
Hauptverfasser: Song, Min, Rudniy, Alex
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Detecting duplicate entities in biological data is an important research task. In this paper, we propose a novel and context-sensitive Markov random field-based edit distance (MRFED) for this task. We apply the Markov random field theory to the Needleman–Wunsch distance and combine MRFED with TFIDF, a token-based distance algorithm, resulting in SoftMRFED. We compare SoftMRFED with other distance algorithms such as Levenshtein, SoftTFIDF, and Monge–Elkan for two matching tasks: biological entity matching and synonym matching. The experimental results show that SoftMRFED significantly outperforms the other edit distance algorithms on several test data collections. In addition, the performance of SoftMRFED is superior to token-based distance algorithms in two matching tasks.
ISSN:0219-1377
0219-3116
DOI:10.1007/s10115-009-0254-7