Evaluating and reducing the effect of data corruption when applying bag of words approaches to medical records
Unlike journal corpora, which are supposed to be carefully reviewed before being published, the quality of documents in a patient record are often corrupted by misspelled words and conventional graphies or abbreviations. After a survey of the domain, the paper focuses on evaluating the effect of suc...
Gespeichert in:
Veröffentlicht in: | International journal of medical informatics (Shannon, Ireland) Ireland), 2002-12, Vol.67 (1), p.75-83 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Unlike journal corpora, which are supposed to be carefully reviewed before being published, the quality of documents in a patient record are often corrupted by misspelled words and conventional graphies or abbreviations. After a survey of the domain, the paper focuses on evaluating the effect of such corruption on an information retrieval (IR) engine. The IR system uses a classical bag of words approach, with stems as representation items and term frequency–inverse document frequency (tf–idf) as weighting schema; we pay special attention to the normalization factor. First results shows that even low corruption levels (3%) do affect retrieval effectiveness (4–7%), whereas higher corruption levels can affect retrieval effectiveness by 25%. Then, we show that the use of an improved automatic spelling correction system, applied on the corrupted collection, can almost restore the retrieval effectiveness of the engine. |
---|---|
ISSN: | 1386-5056 1872-8243 |
DOI: | 10.1016/S1386-5056(02)00057-6 |