Developing a healthcare dataset information resource (DIR) based on Semantic Web

The right dataset is essential to obtain the right insights in data science; therefore, it is important for data scientists to have a good understanding of the availability of relevant datasets as well as the content, structure, and existing analyses of these datasets. While a number of efforts are...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:BMC medical genomics 2018-11, Vol.11 (Suppl 5), p.102-102, Article 102
Hauptverfasser: Shi, Jingyi, Zheng, Mingna, Yao, Lixia, Ge, Yaorong
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The right dataset is essential to obtain the right insights in data science; therefore, it is important for data scientists to have a good understanding of the availability of relevant datasets as well as the content, structure, and existing analyses of these datasets. While a number of efforts are underway to integrate the large amount and variety of datasets, the lack of an information resource that focuses on specific needs of target users of datasets has existed as a problem for years. To address this gap, we have developed a Dataset Information Resource (DIR), using a user-oriented approach, which gathers relevant dataset knowledge for specific user types. In the present version, we specifically address the challenges of entry-level data scientists in learning to identify, understand, and analyze major datasets in healthcare. We emphasize that the DIR does not contain actual data from the datasets but aims to provide comprehensive knowledge about the datasets and their analyses. The DIR leverages Semantic Web technologies and the W3C Dataset Description Profile as the standard for knowledge integration and representation. To extract tailored knowledge for target users, we have developed methods for manual extractions from dataset documentations as well as semi-automatic extractions from related publications, using natural language processing (NLP)-based approaches. A semantic query component is available for knowledge retrieval, and a parameterized question-answering functionality is provided to facilitate the ease of search. The DIR prototype is composed of four major components-dataset metadata and related knowledge, search modules, question answering for frequently-asked questions, and blogs. The current implementation includes information on 12 commonly used large and complex healthcare datasets. The initial usage evaluation based on health informatics novices indicates that the DIR is helpful and beginner-friendly. We have developed a novel user-oriented DIR that provides dataset knowledge specialized for target user groups. Knowledge about datasets is effectively represented in the Semantic Web. At this initial stage, the DIR has already been able to provide sophisticated and relevant knowledge of 12 datasets to help entry health informacians learn healthcare data analysis using suitable datasets. Further development of both content and function levels is underway.
ISSN:1755-8794
1755-8794
DOI:10.1186/s12920-018-0411-5