NHash: Randomized N-Gram Hashing for Distributed Generation of Validatable Unique Study Identifiers in Multicenter Research

A unique study identifier serves as a key for linking research data about a study subject without revealing protected health information in the identifier. While sufficient for single-site and limited-scale studies, the use of common unique study identifiers has several drawbacks for large multicent...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	JMIR medical informatics 2015-11, Vol.3 (4), p.e35-e35
Hauptverfasser:	Zhang, Guo-Qiang, Tao, Shiqiang, Xing, Guangming, Mozes, Jeno, Zonjy, Bilal, Lhatoo, Samden D, Cui, Licong
Format:	Artikel
Sprache:	eng
Schlagworte:	Collaboration Convulsions & seizures Epilepsy Internet service providers Medical records Neurological disorders Ontology Original Paper Patients Personal health
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	A unique study identifier serves as a key for linking research data about a study subject without revealing protected health information in the identifier. While sufficient for single-site and limited-scale studies, the use of common unique study identifiers has several drawbacks for large multicenter studies, where thousands of research participants may be recruited from multiple sites. An important property of study identifiers is error tolerance (or validatable), in that inadvertent editing mistakes during their transmission and use will most likely result in invalid study identifiers. This paper introduces a novel method called "Randomized N-gram Hashing (NHash)," for generating unique study identifiers in a distributed and validatable fashion, in multicenter research. NHash has a unique set of properties: (1) it is a pseudonym serving the purpose of linking research data about a study participant for research purposes; (2) it can be generated automatically in a completely distributed fashion with virtually no risk for identifier collision; (3) it incorporates a set of cryptographic hash functions based on N-grams, with a combination of additional encryption techniques such as a shift cipher; (d) it is validatable (error tolerant) in the sense that inadvertent edit errors will mostly result in invalid identifiers. NHash consists of 2 phases. First, an intermediate string using randomized N-gram hashing is generated. This string consists of a collection of N-gram hashes f1, f2, ..., fk. The input for each function fi has 3 components: a random number r, an integer n, and input data m. The result, fi(r, n, m), is an n-gram of m with a starting position s, which is computed as (r mod \|m\|), where \|m\| represents the length of m. The output for Step 1 is the concatenation of the sequence f1(r1, n1, m1), f2(r2, n2, m2), ..., fk(rk, nk, mk). In the second phase, the intermediate string generated in Phase 1 is encrypted using techniques such as shift cipher. The result of the encryption, concatenated with the random number r, is the final NHash study identifier. We performed experiments using a large synthesized dataset comparing NHash with random strings, and demonstrated neglegible probability for collision. We implemented NHash for the Center for SUDEP Research (CSR), a National Institute for Neurological Disorders and Stroke-funded Center Without Walls for Collaborative Research in the Epilepsies. This multicenter collaboration involves 14 institutions acro
ISSN:	2291-9694 2291-9694
DOI:	10.2196/medinform.4959