Utilizing Deep Learning and Genome Wide Association Studies for Epistatic-Driven Preterm Birth Classification in African-American Women

Genome-Wide Association Studies (GWAS) are used to identify statistically significant genetic variants in case-control studies. The main objective is to find single nucleotide polymorphisms (SNPs) that influence a particular phenotype (i.e., disease trait). GWAS typically use a p-value threshold of...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE/ACM transactions on computational biology and bioinformatics 2020-03, Vol.17 (2), p.668-678
Hauptverfasser: Fergus, Paul, Montanez, Casimiro Curbelo, Abdulaimma, Basma, Lisboa, Paulo, Chalmers, Carl, Pineles, Beth
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 678
container_issue 2
container_start_page 668
container_title IEEE/ACM transactions on computational biology and bioinformatics
container_volume 17
creator Fergus, Paul
Montanez, Casimiro Curbelo
Abdulaimma, Basma
Lisboa, Paulo
Chalmers, Carl
Pineles, Beth
description Genome-Wide Association Studies (GWAS) are used to identify statistically significant genetic variants in case-control studies. The main objective is to find single nucleotide polymorphisms (SNPs) that influence a particular phenotype (i.e., disease trait). GWAS typically use a p-value threshold of 5*10^{-8} 5*10-8 to identify highly ranked SNPs. While this approach has proven useful for detecting disease-susceptible SNPs, evidence has shown that many of these are, in fact, false positives. Consequently, there is some ambiguity about the most suitable threshold for claiming genome-wide significance. Many believe that using lower p-values will allow us to investigate the joint epistatic interactions between SNPs and provide better insights into phenotype expression. One example that uses this approach is multifactor dimensionality reduction (MDR), which identifies combinations of SNPs that interact to influence a particular outcome. However, computational complexity is increased exponentially as a function of higher-order combinations making approaches like MDR difficult to implement. Even so, understanding epistatic interactions in complex diseases is a fundamental component for robust genotype-phenotype mapping. In this paper, we propose a novel framework that combines GWAS quality control and logistic regression with deep learning stacked autoencoders to abstract higher-order SNP interactions from large, complex genotyped data for case-control classification tasks in GWAS analysis. We focus on the challenging problem of classifying preterm births which has a strong genetic component with unexplained heritability reportedly between 20-40 percent. A GWAS data set, obtained from dbGap is utilised, which contains predominantly urban low-income African-American women who had normal and preterm deliveries. Epistatic interactions from original SNP sequences were extracted through a deep learning stacked autoencoder model and used to fine-tune a classifier for discriminating between term and preterm births observations. All models are evaluated using standard binary classifier performance metrics. The findings show that important information pertaining to SNPs and epistasi
doi_str_mv 10.1109/TCBB.2018.2868667
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2386053405</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>8454302</ieee_id><sourcerecordid>2100327593</sourcerecordid><originalsourceid>FETCH-LOGICAL-c392t-831b52f33d59ef732ffffb8116866541b12923687baf21e1fe991fc77363e4c53</originalsourceid><addsrcrecordid>eNpdkd9uFCEUxonR2Fp9AGNiSLzxZlbgADNc7m5rNdmkJrbpJZk_B6WZYVaYMakv0NeWcddelBvOgd_3HchHyFvOVpwz8-l6u9msBOPVSlS60rp8Rk65UmVhjJbPl1qqQhkNJ-RVSneMCWmYfElOIGtAS3VKHm4m3_s_Pvyg54h7usM6hqWrQ0cvMYwD0lvfIV2nNLa-nvwY6Pdp7jwm6sZIL_Y-Tfm4Lc6j_42Bfos4YRzoxsfpJ932dUre-fag9IGuXcxdKNYD_ivobZ4RXpMXru4TvjnuZ-Tm88X19kuxu7r8ul3vihaMmIoKeKOEA-iUQVeCcHk1FefL95XkDRdGgK7KpnaCI3doDHdtWYIGlK2CM_Lx4LuP468Z02QHn1rs-zrgOCcrOGMgSmUgox-eoHfjHEN-nRVQaaZAssWQH6g2jilFdHYf_VDHe8uZXVKyS0p2SckeU8qa90fnuRmwe1T8jyUD7w6AR8TH60oqCUzAXzAwliI</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2386053405</pqid></control><display><type>article</type><title>Utilizing Deep Learning and Genome Wide Association Studies for Epistatic-Driven Preterm Birth Classification in African-American Women</title><source>IEEE Electronic Library (IEL)</source><creator>Fergus, Paul ; Montanez, Casimiro Curbelo ; Abdulaimma, Basma ; Lisboa, Paulo ; Chalmers, Carl ; Pineles, Beth</creator><creatorcontrib>Fergus, Paul ; Montanez, Casimiro Curbelo ; Abdulaimma, Basma ; Lisboa, Paulo ; Chalmers, Carl ; Pineles, Beth</creatorcontrib><description><![CDATA[Genome-Wide Association Studies (GWAS) are used to identify statistically significant genetic variants in case-control studies. The main objective is to find single nucleotide polymorphisms (SNPs) that influence a particular phenotype (i.e., disease trait). GWAS typically use a p-value threshold of <inline-formula><tex-math notation="LaTeX">5*10^{-8}</tex-math> <mml:math><mml:mrow><mml:mn>5</mml:mn><mml:mo>*</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>-</mml:mo><mml:mn>8</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math><inline-graphic xlink:href="fergus-ieq1-2868667.gif"/> </inline-formula> to identify highly ranked SNPs. While this approach has proven useful for detecting disease-susceptible SNPs, evidence has shown that many of these are, in fact, false positives. Consequently, there is some ambiguity about the most suitable threshold for claiming genome-wide significance. Many believe that using lower p-values will allow us to investigate the joint epistatic interactions between SNPs and provide better insights into phenotype expression. One example that uses this approach is multifactor dimensionality reduction (MDR), which identifies combinations of SNPs that interact to influence a particular outcome. However, computational complexity is increased exponentially as a function of higher-order combinations making approaches like MDR difficult to implement. Even so, understanding epistatic interactions in complex diseases is a fundamental component for robust genotype-phenotype mapping. In this paper, we propose a novel framework that combines GWAS quality control and logistic regression with deep learning stacked autoencoders to abstract higher-order SNP interactions from large, complex genotyped data for case-control classification tasks in GWAS analysis. We focus on the challenging problem of classifying preterm births which has a strong genetic component with unexplained heritability reportedly between 20-40 percent. A GWAS data set, obtained from dbGap is utilised, which contains predominantly urban low-income African-American women who had normal and preterm deliveries. Epistatic interactions from original SNP sequences were extracted through a deep learning stacked autoencoder model and used to fine-tune a classifier for discriminating between term and preterm births observations. All models are evaluated using standard binary classifier performance metrics. The findings show that important information pertaining to SNPs and epistasis can be extracted from 4,666 raw SNPs generated using logistic regression (p-value = <inline-formula><tex-math notation="LaTeX">5*10^{-3}</tex-math> <mml:math><mml:mrow><mml:mn>5</mml:mn><mml:mo>*</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>-</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math><inline-graphic xlink:href="fergus-ieq2-2868667.gif"/> </inline-formula>) and used to fit a highly accurate classifier model. The following results (Sen = 0.9562, Spec = 0.8780, Gini = 0.9490, Logloss = 0.5901, AUC = 0.9745, and MSE = 0.2010) were obtained using 50 hidden nodes and (Sen = 0.9289, Spec = 0.9591, Gini = 0.9651, Logloss = 0.3080, AUC = 0.9825, and MSE = 0.0942) using 500 hidden nodes. The results were compared with a Support Vector Machine (SVM), a Random Forest (RF), and a Fishers Linear Discriminant Analysis classifier, which all failed to improve on the deep learning approach.]]></description><identifier>ISSN: 1545-5963</identifier><identifier>EISSN: 1557-9964</identifier><identifier>DOI: 10.1109/TCBB.2018.2868667</identifier><identifier>PMID: 30183645</identifier><identifier>CODEN: ITCBCY</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>African Americans - genetics ; Algorithms ; Bioinformatics ; Classification ; Classifiers ; Computational Biology ; Computer applications ; Deep Learning ; Discriminant analysis ; Diseases ; Epistasis ; Epistasis, Genetic - genetics ; Female ; Gene mapping ; Genetic diversity ; Genetic variance ; Genome-Wide Association Study - methods ; Genomes ; Genomics ; Genotype &amp; phenotype ; GWAS ; Heritability ; Humans ; Infant, Newborn ; Machine learning ; Mapping ; Nodes ; Nucleotides ; Pediatrics ; Performance measurement ; Phenotypes ; Polymorphism, Single Nucleotide - genetics ; Pregnancy ; Premature birth ; Premature Birth - genetics ; Preterm birth ; Quality control ; Regression analysis ; Single-nucleotide polymorphism ; stacked autoencoders ; Statistical analysis ; Support vector machines ; Task complexity</subject><ispartof>IEEE/ACM transactions on computational biology and bioinformatics, 2020-03, Vol.17 (2), p.668-678</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c392t-831b52f33d59ef732ffffb8116866541b12923687baf21e1fe991fc77363e4c53</citedby><cites>FETCH-LOGICAL-c392t-831b52f33d59ef732ffffb8116866541b12923687baf21e1fe991fc77363e4c53</cites><orcidid>0000-0001-6365-4499 ; 0000-0001-5690-2474 ; 0000-0003-0822-1150 ; 0000-0002-7070-4447</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/8454302$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27923,27924,54757</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/8454302$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/30183645$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Fergus, Paul</creatorcontrib><creatorcontrib>Montanez, Casimiro Curbelo</creatorcontrib><creatorcontrib>Abdulaimma, Basma</creatorcontrib><creatorcontrib>Lisboa, Paulo</creatorcontrib><creatorcontrib>Chalmers, Carl</creatorcontrib><creatorcontrib>Pineles, Beth</creatorcontrib><title>Utilizing Deep Learning and Genome Wide Association Studies for Epistatic-Driven Preterm Birth Classification in African-American Women</title><title>IEEE/ACM transactions on computational biology and bioinformatics</title><addtitle>TCBB</addtitle><addtitle>IEEE/ACM Trans Comput Biol Bioinform</addtitle><description><![CDATA[Genome-Wide Association Studies (GWAS) are used to identify statistically significant genetic variants in case-control studies. The main objective is to find single nucleotide polymorphisms (SNPs) that influence a particular phenotype (i.e., disease trait). GWAS typically use a p-value threshold of <inline-formula><tex-math notation="LaTeX">5*10^{-8}</tex-math> <mml:math><mml:mrow><mml:mn>5</mml:mn><mml:mo>*</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>-</mml:mo><mml:mn>8</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math><inline-graphic xlink:href="fergus-ieq1-2868667.gif"/> </inline-formula> to identify highly ranked SNPs. While this approach has proven useful for detecting disease-susceptible SNPs, evidence has shown that many of these are, in fact, false positives. Consequently, there is some ambiguity about the most suitable threshold for claiming genome-wide significance. Many believe that using lower p-values will allow us to investigate the joint epistatic interactions between SNPs and provide better insights into phenotype expression. One example that uses this approach is multifactor dimensionality reduction (MDR), which identifies combinations of SNPs that interact to influence a particular outcome. However, computational complexity is increased exponentially as a function of higher-order combinations making approaches like MDR difficult to implement. Even so, understanding epistatic interactions in complex diseases is a fundamental component for robust genotype-phenotype mapping. In this paper, we propose a novel framework that combines GWAS quality control and logistic regression with deep learning stacked autoencoders to abstract higher-order SNP interactions from large, complex genotyped data for case-control classification tasks in GWAS analysis. We focus on the challenging problem of classifying preterm births which has a strong genetic component with unexplained heritability reportedly between 20-40 percent. A GWAS data set, obtained from dbGap is utilised, which contains predominantly urban low-income African-American women who had normal and preterm deliveries. Epistatic interactions from original SNP sequences were extracted through a deep learning stacked autoencoder model and used to fine-tune a classifier for discriminating between term and preterm births observations. All models are evaluated using standard binary classifier performance metrics. The findings show that important information pertaining to SNPs and epistasis can be extracted from 4,666 raw SNPs generated using logistic regression (p-value = <inline-formula><tex-math notation="LaTeX">5*10^{-3}</tex-math> <mml:math><mml:mrow><mml:mn>5</mml:mn><mml:mo>*</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>-</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math><inline-graphic xlink:href="fergus-ieq2-2868667.gif"/> </inline-formula>) and used to fit a highly accurate classifier model. The following results (Sen = 0.9562, Spec = 0.8780, Gini = 0.9490, Logloss = 0.5901, AUC = 0.9745, and MSE = 0.2010) were obtained using 50 hidden nodes and (Sen = 0.9289, Spec = 0.9591, Gini = 0.9651, Logloss = 0.3080, AUC = 0.9825, and MSE = 0.0942) using 500 hidden nodes. The results were compared with a Support Vector Machine (SVM), a Random Forest (RF), and a Fishers Linear Discriminant Analysis classifier, which all failed to improve on the deep learning approach.]]></description><subject>African Americans - genetics</subject><subject>Algorithms</subject><subject>Bioinformatics</subject><subject>Classification</subject><subject>Classifiers</subject><subject>Computational Biology</subject><subject>Computer applications</subject><subject>Deep Learning</subject><subject>Discriminant analysis</subject><subject>Diseases</subject><subject>Epistasis</subject><subject>Epistasis, Genetic - genetics</subject><subject>Female</subject><subject>Gene mapping</subject><subject>Genetic diversity</subject><subject>Genetic variance</subject><subject>Genome-Wide Association Study - methods</subject><subject>Genomes</subject><subject>Genomics</subject><subject>Genotype &amp; phenotype</subject><subject>GWAS</subject><subject>Heritability</subject><subject>Humans</subject><subject>Infant, Newborn</subject><subject>Machine learning</subject><subject>Mapping</subject><subject>Nodes</subject><subject>Nucleotides</subject><subject>Pediatrics</subject><subject>Performance measurement</subject><subject>Phenotypes</subject><subject>Polymorphism, Single Nucleotide - genetics</subject><subject>Pregnancy</subject><subject>Premature birth</subject><subject>Premature Birth - genetics</subject><subject>Preterm birth</subject><subject>Quality control</subject><subject>Regression analysis</subject><subject>Single-nucleotide polymorphism</subject><subject>stacked autoencoders</subject><subject>Statistical analysis</subject><subject>Support vector machines</subject><subject>Task complexity</subject><issn>1545-5963</issn><issn>1557-9964</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><sourceid>EIF</sourceid><recordid>eNpdkd9uFCEUxonR2Fp9AGNiSLzxZlbgADNc7m5rNdmkJrbpJZk_B6WZYVaYMakv0NeWcddelBvOgd_3HchHyFvOVpwz8-l6u9msBOPVSlS60rp8Rk65UmVhjJbPl1qqQhkNJ-RVSneMCWmYfElOIGtAS3VKHm4m3_s_Pvyg54h7usM6hqWrQ0cvMYwD0lvfIV2nNLa-nvwY6Pdp7jwm6sZIL_Y-Tfm4Lc6j_42Bfos4YRzoxsfpJ932dUre-fag9IGuXcxdKNYD_ivobZ4RXpMXru4TvjnuZ-Tm88X19kuxu7r8ul3vihaMmIoKeKOEA-iUQVeCcHk1FefL95XkDRdGgK7KpnaCI3doDHdtWYIGlK2CM_Lx4LuP468Z02QHn1rs-zrgOCcrOGMgSmUgox-eoHfjHEN-nRVQaaZAssWQH6g2jilFdHYf_VDHe8uZXVKyS0p2SckeU8qa90fnuRmwe1T8jyUD7w6AR8TH60oqCUzAXzAwliI</recordid><startdate>202003</startdate><enddate>202003</enddate><creator>Fergus, Paul</creator><creator>Montanez, Casimiro Curbelo</creator><creator>Abdulaimma, Basma</creator><creator>Lisboa, Paulo</creator><creator>Chalmers, Carl</creator><creator>Pineles, Beth</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QF</scope><scope>7QO</scope><scope>7QQ</scope><scope>7SC</scope><scope>7SE</scope><scope>7SP</scope><scope>7SR</scope><scope>7TA</scope><scope>7TB</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>F28</scope><scope>FR3</scope><scope>H8D</scope><scope>JG9</scope><scope>JQ2</scope><scope>KR7</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>P64</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0001-6365-4499</orcidid><orcidid>https://orcid.org/0000-0001-5690-2474</orcidid><orcidid>https://orcid.org/0000-0003-0822-1150</orcidid><orcidid>https://orcid.org/0000-0002-7070-4447</orcidid></search><sort><creationdate>202003</creationdate><title>Utilizing Deep Learning and Genome Wide Association Studies for Epistatic-Driven Preterm Birth Classification in African-American Women</title><author>Fergus, Paul ; Montanez, Casimiro Curbelo ; Abdulaimma, Basma ; Lisboa, Paulo ; Chalmers, Carl ; Pineles, Beth</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c392t-831b52f33d59ef732ffffb8116866541b12923687baf21e1fe991fc77363e4c53</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>African Americans - genetics</topic><topic>Algorithms</topic><topic>Bioinformatics</topic><topic>Classification</topic><topic>Classifiers</topic><topic>Computational Biology</topic><topic>Computer applications</topic><topic>Deep Learning</topic><topic>Discriminant analysis</topic><topic>Diseases</topic><topic>Epistasis</topic><topic>Epistasis, Genetic - genetics</topic><topic>Female</topic><topic>Gene mapping</topic><topic>Genetic diversity</topic><topic>Genetic variance</topic><topic>Genome-Wide Association Study - methods</topic><topic>Genomes</topic><topic>Genomics</topic><topic>Genotype &amp; phenotype</topic><topic>GWAS</topic><topic>Heritability</topic><topic>Humans</topic><topic>Infant, Newborn</topic><topic>Machine learning</topic><topic>Mapping</topic><topic>Nodes</topic><topic>Nucleotides</topic><topic>Pediatrics</topic><topic>Performance measurement</topic><topic>Phenotypes</topic><topic>Polymorphism, Single Nucleotide - genetics</topic><topic>Pregnancy</topic><topic>Premature birth</topic><topic>Premature Birth - genetics</topic><topic>Preterm birth</topic><topic>Quality control</topic><topic>Regression analysis</topic><topic>Single-nucleotide polymorphism</topic><topic>stacked autoencoders</topic><topic>Statistical analysis</topic><topic>Support vector machines</topic><topic>Task complexity</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Fergus, Paul</creatorcontrib><creatorcontrib>Montanez, Casimiro Curbelo</creatorcontrib><creatorcontrib>Abdulaimma, Basma</creatorcontrib><creatorcontrib>Lisboa, Paulo</creatorcontrib><creatorcontrib>Chalmers, Carl</creatorcontrib><creatorcontrib>Pineles, Beth</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Aluminium Industry Abstracts</collection><collection>Biotechnology Research Abstracts</collection><collection>Ceramic Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>Corrosion Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Materials Business File</collection><collection>Mechanical &amp; Transportation Engineering Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>ANTE: Abstracts in New Technology &amp; Engineering</collection><collection>Engineering Research Database</collection><collection>Aerospace Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Civil Engineering Abstracts</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE/ACM transactions on computational biology and bioinformatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Fergus, Paul</au><au>Montanez, Casimiro Curbelo</au><au>Abdulaimma, Basma</au><au>Lisboa, Paulo</au><au>Chalmers, Carl</au><au>Pineles, Beth</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Utilizing Deep Learning and Genome Wide Association Studies for Epistatic-Driven Preterm Birth Classification in African-American Women</atitle><jtitle>IEEE/ACM transactions on computational biology and bioinformatics</jtitle><stitle>TCBB</stitle><addtitle>IEEE/ACM Trans Comput Biol Bioinform</addtitle><date>2020-03</date><risdate>2020</risdate><volume>17</volume><issue>2</issue><spage>668</spage><epage>678</epage><pages>668-678</pages><issn>1545-5963</issn><eissn>1557-9964</eissn><coden>ITCBCY</coden><abstract><![CDATA[Genome-Wide Association Studies (GWAS) are used to identify statistically significant genetic variants in case-control studies. The main objective is to find single nucleotide polymorphisms (SNPs) that influence a particular phenotype (i.e., disease trait). GWAS typically use a p-value threshold of <inline-formula><tex-math notation="LaTeX">5*10^{-8}</tex-math> <mml:math><mml:mrow><mml:mn>5</mml:mn><mml:mo>*</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>-</mml:mo><mml:mn>8</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math><inline-graphic xlink:href="fergus-ieq1-2868667.gif"/> </inline-formula> to identify highly ranked SNPs. While this approach has proven useful for detecting disease-susceptible SNPs, evidence has shown that many of these are, in fact, false positives. Consequently, there is some ambiguity about the most suitable threshold for claiming genome-wide significance. Many believe that using lower p-values will allow us to investigate the joint epistatic interactions between SNPs and provide better insights into phenotype expression. One example that uses this approach is multifactor dimensionality reduction (MDR), which identifies combinations of SNPs that interact to influence a particular outcome. However, computational complexity is increased exponentially as a function of higher-order combinations making approaches like MDR difficult to implement. Even so, understanding epistatic interactions in complex diseases is a fundamental component for robust genotype-phenotype mapping. In this paper, we propose a novel framework that combines GWAS quality control and logistic regression with deep learning stacked autoencoders to abstract higher-order SNP interactions from large, complex genotyped data for case-control classification tasks in GWAS analysis. We focus on the challenging problem of classifying preterm births which has a strong genetic component with unexplained heritability reportedly between 20-40 percent. A GWAS data set, obtained from dbGap is utilised, which contains predominantly urban low-income African-American women who had normal and preterm deliveries. Epistatic interactions from original SNP sequences were extracted through a deep learning stacked autoencoder model and used to fine-tune a classifier for discriminating between term and preterm births observations. All models are evaluated using standard binary classifier performance metrics. The findings show that important information pertaining to SNPs and epistasis can be extracted from 4,666 raw SNPs generated using logistic regression (p-value = <inline-formula><tex-math notation="LaTeX">5*10^{-3}</tex-math> <mml:math><mml:mrow><mml:mn>5</mml:mn><mml:mo>*</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>-</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math><inline-graphic xlink:href="fergus-ieq2-2868667.gif"/> </inline-formula>) and used to fit a highly accurate classifier model. The following results (Sen = 0.9562, Spec = 0.8780, Gini = 0.9490, Logloss = 0.5901, AUC = 0.9745, and MSE = 0.2010) were obtained using 50 hidden nodes and (Sen = 0.9289, Spec = 0.9591, Gini = 0.9651, Logloss = 0.3080, AUC = 0.9825, and MSE = 0.0942) using 500 hidden nodes. The results were compared with a Support Vector Machine (SVM), a Random Forest (RF), and a Fishers Linear Discriminant Analysis classifier, which all failed to improve on the deep learning approach.]]></abstract><cop>United States</cop><pub>IEEE</pub><pmid>30183645</pmid><doi>10.1109/TCBB.2018.2868667</doi><tpages>11</tpages><orcidid>https://orcid.org/0000-0001-6365-4499</orcidid><orcidid>https://orcid.org/0000-0001-5690-2474</orcidid><orcidid>https://orcid.org/0000-0003-0822-1150</orcidid><orcidid>https://orcid.org/0000-0002-7070-4447</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1545-5963
ispartof IEEE/ACM transactions on computational biology and bioinformatics, 2020-03, Vol.17 (2), p.668-678
issn 1545-5963
1557-9964
language eng
recordid cdi_proquest_journals_2386053405
source IEEE Electronic Library (IEL)
subjects African Americans - genetics
Algorithms
Bioinformatics
Classification
Classifiers
Computational Biology
Computer applications
Deep Learning
Discriminant analysis
Diseases
Epistasis
Epistasis, Genetic - genetics
Female
Gene mapping
Genetic diversity
Genetic variance
Genome-Wide Association Study - methods
Genomes
Genomics
Genotype & phenotype
GWAS
Heritability
Humans
Infant, Newborn
Machine learning
Mapping
Nodes
Nucleotides
Pediatrics
Performance measurement
Phenotypes
Polymorphism, Single Nucleotide - genetics
Pregnancy
Premature birth
Premature Birth - genetics
Preterm birth
Quality control
Regression analysis
Single-nucleotide polymorphism
stacked autoencoders
Statistical analysis
Support vector machines
Task complexity
title Utilizing Deep Learning and Genome Wide Association Studies for Epistatic-Driven Preterm Birth Classification in African-American Women
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-08T07%3A15%3A37IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Utilizing%20Deep%20Learning%20and%20Genome%20Wide%20Association%20Studies%20for%20Epistatic-Driven%20Preterm%20Birth%20Classification%20in%20African-American%20Women&rft.jtitle=IEEE/ACM%20transactions%20on%20computational%20biology%20and%20bioinformatics&rft.au=Fergus,%20Paul&rft.date=2020-03&rft.volume=17&rft.issue=2&rft.spage=668&rft.epage=678&rft.pages=668-678&rft.issn=1545-5963&rft.eissn=1557-9964&rft.coden=ITCBCY&rft_id=info:doi/10.1109/TCBB.2018.2868667&rft_dat=%3Cproquest_RIE%3E2100327593%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2386053405&rft_id=info:pmid/30183645&rft_ieee_id=8454302&rfr_iscdi=true