A method of inferring the relationship between Biomedical entities through correlation analysis on text

One of the most important processes in a machine learning-based natural language processing is to represent words. The one-hot representation that has been commonly used has a large size of vector and assumes that the features that make up the vector are independent of each other. On the other hand,...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Biomedical engineering online 2018-11, Vol.17 (Suppl 2), p.155-155, Article 155
Hauptverfasser:	Song, Hye-Jeong, Yoon, Byeong-Hun, Youn, Young-Shin, Park, Chan-Young, Kim, Jong-Dae, Kim, Yu-Seop
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial intelligence Bio-marker Bioindicators Bioinformatics Biological markers Biomarkers Biomarkers - metabolism Biomedical Research - methods Cancer Canonical Correlation Analysis (CCA) Correlation analysis Disease Diseases Embedding Estimation Gene expression Genomes Language Learning algorithms Lexical similarity Linguistics Machine Learning Methods Microbiology Microorganisms Natural Language Processing Physiological aspects Proteins Scientific papers Search engines Similarity t-distributed stochastic neighbor embedding (t-SNE) Texts Word embedding
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	One of the most important processes in a machine learning-based natural language processing is to represent words. The one-hot representation that has been commonly used has a large size of vector and assumes that the features that make up the vector are independent of each other. On the other hand, it is known that word embedding has a great effect in estimating the similarity between words because it expresses the meaning of the word well. In this study, we try to clarify the correlation between various terms in the biomedical texts based on the excellent ability of estimating similarity between words shown by word embedding. Therefore, we used word embedding to find new biomarkers and microorganisms related to a specific diseases. In this study, we try to analyze the correlation between diseases-markers and diseases-microorganisms. First, we need to construct a corpus that seems to be related to them. To do this, we extract the titles and abstracts from the biomedical texts on the PubMed site. Second, we express diseases, markers, and microorganisms' terms in word embedding using Canonical Correlation Analysis (CCA). CCA is a statistical based methodology that has a very good performance on vector dimension reduction. Finally, we tried to estimate the relationship between diseases-markers pairs and diseases-microorganisms pairs by measuring their similarity. In the experiment, we tried to confirm the correlation derived through word embedding using Google Scholar search results. Of the top 20 highly correlated disease-marker pairs, about 85% of the pairs have actually undergone a lot of research as a result of Google Scholars search. Conversely, for 85% of the 20 pairs with the lowest correlation, we could not actually find any other study to determine the relationship between the disease and the marker. This trend was similar for disease-microbe pairs. The correlation between diseases and markers and diseases and microorganisms calculated through word embedding reflects actual research trends. If the word-embedding correlation is high, but there are not many published actual studies, additional research can be proposed for the pair.
ISSN:	1475-925X 1475-925X
DOI:	10.1186/s12938-018-0583-4