Probabilistic Record Linkage (PRL) advances between Surveillance Data and HES datasets using the Expectation-Maximization (EM) algorithm: IJPDS (2017) Issue 1, Vol 1:350 Proceedings of the IPDLN Conference (August 2016)
ABSTRACT ObjectivesThis study shows how to improve PRL when using a limited number of personal identifiable information (PIIs) available in HES such as NHS number, DOB, Hospital Number, Gender and Postcode. The focus is on new approaches in categorisation of m-u probabilities for fields related to P...
Gespeichert in:
Veröffentlicht in: | International journal of population data science 2017-04, Vol.1 (1) |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | ABSTRACT
ObjectivesThis study shows how to improve PRL when using a limited number of personal identifiable information (PIIs) available in HES such as NHS number, DOB, Hospital Number, Gender and Postcode. The focus is on new approaches in categorisation of m-u probabilities for fields related to Postcode. In addition, a brief outline of our procedure to generate an appropriate training dataset will be presented as well as reasons for splitting HES dataset into single and multi-admissions, and proposed techniques to avoid computational cost.
ApproachPIIs such as DOB and Postcode can be categorised in order to increase the number of potential comparison vectors which is necessary in order to deal with the multi-million size of the HES dataset. We propose a multinomial approach for comparison between postcodes both for agreement and disagreement case scenarios, based on geographical information and a similarity score resulting from a Probit model. In the absence of a ‘gold-standard’ dataset, generating an appropriate training set is possible by combining two-way Deterministic Linkage (DL) and simple random sampling. Splitting HES into single and multi-admissions is recommended for reasons related to computational performance and demographic differences of the underlying population. Selection of the lower and upper thresholds regarding linkage weights can be done in a practical way by using conditional Normal mixture modelling, or more formally by using sampling techniques based on Fellegi-Sunter decision rule. We show how the latter can be further improved avoiding the computational cost by introducing appropriate mapping functions.
ResultsWe show the advantages of PRL over DL between Surveillance Laboratory Flu Data and HES. Preliminary results indicate an additional 10% of matches for < 1% expected false positives. EM convergence problems when treating missing values as separate outcome category will be discussed comparing results with the suggested re-normalisation approach in the bibliography.
ConclusionPRL between Laboratory Surveillance Data and Hospital Episode Statistics (HES) is a common practice and performed in a number of epidemiological studies conducted at Public Health England (PHE). Therefore, improving linkage will subsequently benefit those studies. Furthermore, this study proposes new approaches for Postcode comparison, and the utilisation of mapping functions to avoid the computational cost of the Fellegi-Sunter decision rule. These appr |
---|---|
ISSN: | 2399-4908 2399-4908 |
DOI: | 10.23889/ijpds.v1i1.371 |