Evaluating and reducing the effect of data corruption when applying bag of words approaches to medical records

Unlike journal corpora, which are supposed to be carefully reviewed before being published, the quality of documents in a patient record are often corrupted by misspelled words and conventional graphies or abbreviations. After a survey of the domain, the paper focuses on evaluating the effect of suc...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:International journal of medical informatics (Shannon, Ireland) Ireland), 2002-12, Vol.67 (1), p.75-83
Hauptverfasser: Ruch, P, Baud, R, Geissbühler, A
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Unlike journal corpora, which are supposed to be carefully reviewed before being published, the quality of documents in a patient record are often corrupted by misspelled words and conventional graphies or abbreviations. After a survey of the domain, the paper focuses on evaluating the effect of such corruption on an information retrieval (IR) engine. The IR system uses a classical bag of words approach, with stems as representation items and term frequency–inverse document frequency (tf–idf) as weighting schema; we pay special attention to the normalization factor. First results shows that even low corruption levels (3%) do affect retrieval effectiveness (4–7%), whereas higher corruption levels can affect retrieval effectiveness by 25%. Then, we show that the use of an improved automatic spelling correction system, applied on the corrupted collection, can almost restore the retrieval effectiveness of the engine.
ISSN:1386-5056
1872-8243
DOI:10.1016/S1386-5056(02)00057-6