Linkability measures to assess the data characteristics for record linkage

Abstract Objectives Accurate record linkage (RL) enables consolidation and de-duplication of data from disparate datasets, resulting in more comprehensive and complete patient data. However, conducting RL with low quality or unfit data can waste institutional resources on poor linkage results. We ai...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of the American Medical Informatics Association : JAMIA 2024-11, Vol.31 (11), p.2651-2659
Hauptverfasser: Ong, Toan C, Hill, Andrew, Kahn, Michael G, Lembcke, Lauren R, Schilling, Lisa M, Grannis, Shaun J
Format: Artikel
Sprache:eng
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 2659
container_issue 11
container_start_page 2651
container_title Journal of the American Medical Informatics Association : JAMIA
container_volume 31
creator Ong, Toan C
Hill, Andrew
Kahn, Michael G
Lembcke, Lauren R
Schilling, Lisa M
Grannis, Shaun J
description Abstract Objectives Accurate record linkage (RL) enables consolidation and de-duplication of data from disparate datasets, resulting in more comprehensive and complete patient data. However, conducting RL with low quality or unfit data can waste institutional resources on poor linkage results. We aim to evaluate data linkability to enhance the effectiveness of record linkage. Materials and Methods We describe a systematic approach using data fitness (“linkability”) measures, defined as metrics that characterize the availability, discriminatory power, and distribution of potential variables for RL. We used the isolation forest algorithm to detect abnormal linkability values from 188 sites in Indiana and Colorado, and manually reviewed the data to understand the cause of anomalies. Result We calculated 10 linkability metrics for 11 potential linkage variables (LVs) across 188 sites for a total of 20 680 linkability metrics. Potential LVs such as first name, last name, date of birth, and sex have low missing data rates, while Social Security Number vary widely in completeness among all sites. We investigated anomalous linkability values to identify the cause of many records having identical values in certain LVs, issues with placeholder values disguising data missingness, and orphan records. Discussion The fitness of a variable for RL is determined by its availability and its discriminatory power to uniquely identify individuals. These results highlight the need for awareness of placeholder values, which inform the selection of variables and methods to optimize RL performance. Conclusion Evaluating linkability measures using the isolation forest algorithm to highlight anomalous findings can help identify fitness-for-use issues that must be addressed before initiating the RL process to ensure high-quality linkage outcomes.
doi_str_mv 10.1093/jamia/ocae248
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_3107162532</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><oup_id>10.1093/jamia/ocae248</oup_id><sourcerecordid>3107162532</sourcerecordid><originalsourceid>FETCH-LOGICAL-c251t-3cb5444ee1549cad2e031dd2c2b37e43dfa06cb9588dbe3584319f93a16b6f403</originalsourceid><addsrcrecordid>eNqFkDtPwzAURi0EoqUwsiKPLKF-5jGiivJQJRaQ2KIb-4a6JHWxk6H_npQWGJn8DUfnyoeQS85uOCvkdAWtg6k3gELlR2TMtciSIlNvx8NmaZZoJrIROYtxxRhPhdSnZCQLOWzJxuRp4dYfULnGdVvaIsQ-YKSdpxAjxmEtkVrogJolBDAdBhc7ZyKtfaABjQ-WNjvFO56TkxqaiBeHd0Je53cvs4dk8Xz_OLtdJEZo3iXSVFophci1KgxYgUxya4URlcxQSVsDS01V6Dy3FUqdK8mLupDA0yqtFZMTcr33boL_7DF2ZeuiwaaBNfo-lpKzbPiolmJAkz1qgo8xYF1ugmshbEvOyl2-8jtfecg38FcHdV-1aH_pn15_t32_-cf1BR6pe18</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3107162532</pqid></control><display><type>article</type><title>Linkability measures to assess the data characteristics for record linkage</title><source>Oxford University Press Journals All Titles (1996-Current)</source><creator>Ong, Toan C ; Hill, Andrew ; Kahn, Michael G ; Lembcke, Lauren R ; Schilling, Lisa M ; Grannis, Shaun J</creator><creatorcontrib>Ong, Toan C ; Hill, Andrew ; Kahn, Michael G ; Lembcke, Lauren R ; Schilling, Lisa M ; Grannis, Shaun J</creatorcontrib><description>Abstract Objectives Accurate record linkage (RL) enables consolidation and de-duplication of data from disparate datasets, resulting in more comprehensive and complete patient data. However, conducting RL with low quality or unfit data can waste institutional resources on poor linkage results. We aim to evaluate data linkability to enhance the effectiveness of record linkage. Materials and Methods We describe a systematic approach using data fitness (“linkability”) measures, defined as metrics that characterize the availability, discriminatory power, and distribution of potential variables for RL. We used the isolation forest algorithm to detect abnormal linkability values from 188 sites in Indiana and Colorado, and manually reviewed the data to understand the cause of anomalies. Result We calculated 10 linkability metrics for 11 potential linkage variables (LVs) across 188 sites for a total of 20 680 linkability metrics. Potential LVs such as first name, last name, date of birth, and sex have low missing data rates, while Social Security Number vary widely in completeness among all sites. We investigated anomalous linkability values to identify the cause of many records having identical values in certain LVs, issues with placeholder values disguising data missingness, and orphan records. Discussion The fitness of a variable for RL is determined by its availability and its discriminatory power to uniquely identify individuals. These results highlight the need for awareness of placeholder values, which inform the selection of variables and methods to optimize RL performance. Conclusion Evaluating linkability measures using the isolation forest algorithm to highlight anomalous findings can help identify fitness-for-use issues that must be addressed before initiating the RL process to ensure high-quality linkage outcomes.</description><identifier>ISSN: 1067-5027</identifier><identifier>ISSN: 1527-974X</identifier><identifier>EISSN: 1527-974X</identifier><identifier>DOI: 10.1093/jamia/ocae248</identifier><identifier>PMID: 39301630</identifier><language>eng</language><publisher>England: Oxford University Press</publisher><ispartof>Journal of the American Medical Informatics Association : JAMIA, 2024-11, Vol.31 (11), p.2651-2659</ispartof><rights>The Author(s) 2024. Published by Oxford University Press on behalf of the American Medical Informatics Association. 2024</rights><rights>The Author(s) 2024. Published by Oxford University Press on behalf of the American Medical Informatics Association.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c251t-3cb5444ee1549cad2e031dd2c2b37e43dfa06cb9588dbe3584319f93a16b6f403</cites><orcidid>0000-0002-8093-6639 ; 0000-0003-4786-6875 ; 0000-0001-6787-1407</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>315,781,785,1585,27929,27930</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/39301630$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Ong, Toan C</creatorcontrib><creatorcontrib>Hill, Andrew</creatorcontrib><creatorcontrib>Kahn, Michael G</creatorcontrib><creatorcontrib>Lembcke, Lauren R</creatorcontrib><creatorcontrib>Schilling, Lisa M</creatorcontrib><creatorcontrib>Grannis, Shaun J</creatorcontrib><title>Linkability measures to assess the data characteristics for record linkage</title><title>Journal of the American Medical Informatics Association : JAMIA</title><addtitle>J Am Med Inform Assoc</addtitle><description>Abstract Objectives Accurate record linkage (RL) enables consolidation and de-duplication of data from disparate datasets, resulting in more comprehensive and complete patient data. However, conducting RL with low quality or unfit data can waste institutional resources on poor linkage results. We aim to evaluate data linkability to enhance the effectiveness of record linkage. Materials and Methods We describe a systematic approach using data fitness (“linkability”) measures, defined as metrics that characterize the availability, discriminatory power, and distribution of potential variables for RL. We used the isolation forest algorithm to detect abnormal linkability values from 188 sites in Indiana and Colorado, and manually reviewed the data to understand the cause of anomalies. Result We calculated 10 linkability metrics for 11 potential linkage variables (LVs) across 188 sites for a total of 20 680 linkability metrics. Potential LVs such as first name, last name, date of birth, and sex have low missing data rates, while Social Security Number vary widely in completeness among all sites. We investigated anomalous linkability values to identify the cause of many records having identical values in certain LVs, issues with placeholder values disguising data missingness, and orphan records. Discussion The fitness of a variable for RL is determined by its availability and its discriminatory power to uniquely identify individuals. These results highlight the need for awareness of placeholder values, which inform the selection of variables and methods to optimize RL performance. Conclusion Evaluating linkability measures using the isolation forest algorithm to highlight anomalous findings can help identify fitness-for-use issues that must be addressed before initiating the RL process to ensure high-quality linkage outcomes.</description><issn>1067-5027</issn><issn>1527-974X</issn><issn>1527-974X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>TOX</sourceid><recordid>eNqFkDtPwzAURi0EoqUwsiKPLKF-5jGiivJQJRaQ2KIb-4a6JHWxk6H_npQWGJn8DUfnyoeQS85uOCvkdAWtg6k3gELlR2TMtciSIlNvx8NmaZZoJrIROYtxxRhPhdSnZCQLOWzJxuRp4dYfULnGdVvaIsQ-YKSdpxAjxmEtkVrogJolBDAdBhc7ZyKtfaABjQ-WNjvFO56TkxqaiBeHd0Je53cvs4dk8Xz_OLtdJEZo3iXSVFophci1KgxYgUxya4URlcxQSVsDS01V6Dy3FUqdK8mLupDA0yqtFZMTcr33boL_7DF2ZeuiwaaBNfo-lpKzbPiolmJAkz1qgo8xYF1ugmshbEvOyl2-8jtfecg38FcHdV-1aH_pn15_t32_-cf1BR6pe18</recordid><startdate>20241101</startdate><enddate>20241101</enddate><creator>Ong, Toan C</creator><creator>Hill, Andrew</creator><creator>Kahn, Michael G</creator><creator>Lembcke, Lauren R</creator><creator>Schilling, Lisa M</creator><creator>Grannis, Shaun J</creator><general>Oxford University Press</general><scope>TOX</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-8093-6639</orcidid><orcidid>https://orcid.org/0000-0003-4786-6875</orcidid><orcidid>https://orcid.org/0000-0001-6787-1407</orcidid></search><sort><creationdate>20241101</creationdate><title>Linkability measures to assess the data characteristics for record linkage</title><author>Ong, Toan C ; Hill, Andrew ; Kahn, Michael G ; Lembcke, Lauren R ; Schilling, Lisa M ; Grannis, Shaun J</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c251t-3cb5444ee1549cad2e031dd2c2b37e43dfa06cb9588dbe3584319f93a16b6f403</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ong, Toan C</creatorcontrib><creatorcontrib>Hill, Andrew</creatorcontrib><creatorcontrib>Kahn, Michael G</creatorcontrib><creatorcontrib>Lembcke, Lauren R</creatorcontrib><creatorcontrib>Schilling, Lisa M</creatorcontrib><creatorcontrib>Grannis, Shaun J</creatorcontrib><collection>Access via Oxford University Press (Open Access Collection)</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><jtitle>Journal of the American Medical Informatics Association : JAMIA</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ong, Toan C</au><au>Hill, Andrew</au><au>Kahn, Michael G</au><au>Lembcke, Lauren R</au><au>Schilling, Lisa M</au><au>Grannis, Shaun J</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Linkability measures to assess the data characteristics for record linkage</atitle><jtitle>Journal of the American Medical Informatics Association : JAMIA</jtitle><addtitle>J Am Med Inform Assoc</addtitle><date>2024-11-01</date><risdate>2024</risdate><volume>31</volume><issue>11</issue><spage>2651</spage><epage>2659</epage><pages>2651-2659</pages><issn>1067-5027</issn><issn>1527-974X</issn><eissn>1527-974X</eissn><abstract>Abstract Objectives Accurate record linkage (RL) enables consolidation and de-duplication of data from disparate datasets, resulting in more comprehensive and complete patient data. However, conducting RL with low quality or unfit data can waste institutional resources on poor linkage results. We aim to evaluate data linkability to enhance the effectiveness of record linkage. Materials and Methods We describe a systematic approach using data fitness (“linkability”) measures, defined as metrics that characterize the availability, discriminatory power, and distribution of potential variables for RL. We used the isolation forest algorithm to detect abnormal linkability values from 188 sites in Indiana and Colorado, and manually reviewed the data to understand the cause of anomalies. Result We calculated 10 linkability metrics for 11 potential linkage variables (LVs) across 188 sites for a total of 20 680 linkability metrics. Potential LVs such as first name, last name, date of birth, and sex have low missing data rates, while Social Security Number vary widely in completeness among all sites. We investigated anomalous linkability values to identify the cause of many records having identical values in certain LVs, issues with placeholder values disguising data missingness, and orphan records. Discussion The fitness of a variable for RL is determined by its availability and its discriminatory power to uniquely identify individuals. These results highlight the need for awareness of placeholder values, which inform the selection of variables and methods to optimize RL performance. Conclusion Evaluating linkability measures using the isolation forest algorithm to highlight anomalous findings can help identify fitness-for-use issues that must be addressed before initiating the RL process to ensure high-quality linkage outcomes.</abstract><cop>England</cop><pub>Oxford University Press</pub><pmid>39301630</pmid><doi>10.1093/jamia/ocae248</doi><tpages>9</tpages><orcidid>https://orcid.org/0000-0002-8093-6639</orcidid><orcidid>https://orcid.org/0000-0003-4786-6875</orcidid><orcidid>https://orcid.org/0000-0001-6787-1407</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1067-5027
ispartof Journal of the American Medical Informatics Association : JAMIA, 2024-11, Vol.31 (11), p.2651-2659
issn 1067-5027
1527-974X
1527-974X
language eng
recordid cdi_proquest_miscellaneous_3107162532
source Oxford University Press Journals All Titles (1996-Current)
title Linkability measures to assess the data characteristics for record linkage
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-11T16%3A47%3A12IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Linkability%20measures%20to%20assess%20the%20data%20characteristics%20for%20record%20linkage&rft.jtitle=Journal%20of%20the%20American%20Medical%20Informatics%20Association%20:%20JAMIA&rft.au=Ong,%20Toan%20C&rft.date=2024-11-01&rft.volume=31&rft.issue=11&rft.spage=2651&rft.epage=2659&rft.pages=2651-2659&rft.issn=1067-5027&rft.eissn=1527-974X&rft_id=info:doi/10.1093/jamia/ocae248&rft_dat=%3Cproquest_cross%3E3107162532%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3107162532&rft_id=info:pmid/39301630&rft_oup_id=10.1093/jamia/ocae248&rfr_iscdi=true