Classify Alzheimer genes association using Naïve Bayes algorithm
Alzheimer's disease, the most common form of dementia, accounts for 60–80% of cases and its prevalence is projected to increase as aging populations grow. By 2050, the number of individuals with Alzheimer's and dementia worldwide is expected to reach 152 million. Genetics plays a significa...
Gespeichert in:
Veröffentlicht in: | Human gene (Amsterdam, Netherlands) Netherlands), 2024-09, Vol.41, p.201309, Article 201309 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Alzheimer's disease, the most common form of dementia, accounts for 60–80% of cases and its prevalence is projected to increase as aging populations grow. By 2050, the number of individuals with Alzheimer's and dementia worldwide is expected to reach 152 million. Genetics plays a significant role, contributing to about 70% of the overall risk, underscoring the importance of understanding the genetic basis for developing targeted interventions. This study presents a system that combines text mining and machine learning techniques to identify and prioritize prospective candidate genes for Alzheimer's and further classifies them into three association classes with weights.
The machine learning-based classifier was trained over a meticulously curated gold standard dataset and then rigorously validated utilizing a 10-fold cross-validation method, demonstrating its consistency across all the folds of the data. This developed ensemble learning system categorizes PubMed abstracts into three distinct groups: Yes, No, and Ambiguous using text mining and a Bayesian classification algorithm. The system further predicts disease-gene associations over unknown disease-specific prediction data by using the developed classifier.
With an average accuracy of 87.33% and confidence level of 90.10% +/− 0.142, the protocol effectively extracted 2031 associated genes, of which 1162, 489 and 1439 belong to positive, negative and ambiguous classes respectively at the threshold of 0.9. In comparison between the established disease gene databases, our system identified 915 positive genes that had not been previously reported. One can use these positive genes for in-depth understanding and ambiguous genes for further exploration of their association with Alzheimer's disease.
The system's ability to generate accurate predictions demonstrates its robustness and provides valuable insights into the genetic factors of Alzheimer's disease. Consequently, this study contributes to existing knowledge and paves the way for future research in this field.
•Developed an ML and Text Mining based Disease Gene Association Classifier (DGAC).•DGAC trained and rigorously validated over a gold standard dataset using 10KCV.•DGAC classifies sentences into Positive, Negative and Ambiguous nuanced classes.•With an accuracy of 87.33% protocol effectively extracts 2031 associated genes.•915 Alzheimer's Candidate Genes previously unreported were identified. |
---|---|
ISSN: | 2773-0441 2773-0441 |
DOI: | 10.1016/j.humgen.2024.201309 |