Evaluating a linear k-mer model for protein-DNA interactions using high-throughput SELEX data
Transcription factor (TF) binding to DNA can be modeled in a number of different ways. It is highly debated which modeling methods are the best, how the models should be built and what can they be applied to. In this study a linear k-mer model proposed for predicting TF specificity in protein bindin...
Gespeichert in:
Veröffentlicht in: | BMC bioinformatics 2013-08, Vol.14 Suppl 10 (S10), p.S2-S2, Article S2 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | S2 |
---|---|
container_issue | S10 |
container_start_page | S2 |
container_title | BMC bioinformatics |
container_volume | 14 Suppl 10 |
creator | Kähärä, Juhani Lähdesmäki, Harri |
description | Transcription factor (TF) binding to DNA can be modeled in a number of different ways. It is highly debated which modeling methods are the best, how the models should be built and what can they be applied to. In this study a linear k-mer model proposed for predicting TF specificity in protein binding microarrays (PBM) is applied to a high-throughput SELEX data and the question of how to choose the most informative k-mers to the binding model is studied. We implemented the standard cross-validation scheme to reduce the number of k-mers in the model and observed that the number of k-mers can often be reduced significantly without a great negative effect on prediction accuracy. We also found that the later SELEX enrichment cycles provide a much better discrimination between bound and unbound sequences as model prediction accuracies increased for all proteins together with the cycle number. We compared prediction performance of k-mer and position specific weight matrix (PWM) models derived from the same SELEX data. Consistent with previous results on PBM data, performance of the k-mer model was on average 9%-units better. For the 15 proteins in the SELEX data set with medium enrichment cycles, classification accuracies were on average 71% and 62% for k-mer and PWMs, respectively. Finally, the k-mer model trained with SELEX data was evaluated on ChIP-seq data demonstrating substantial improvements for some proteins. For protein GATA1 the model can distinquish between true ChIP-seq peaks and negative peaks. For proteins RFX3 and NFATC1 the performance of the model was no better than chance. |
doi_str_mv | 10.1186/1471-2105-14-S10-S2 |
format | Article |
fullrecord | <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_3750486</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3047764671</sourcerecordid><originalsourceid>FETCH-LOGICAL-c400t-76425c4708a5a91cd5d68f3e0ad19e06f60e236d1b4183093a802e994fe795d63</originalsourceid><addsrcrecordid>eNpVUV1LwzAUDaK4Of0FggR8jiZN-vUijFk_YOhDFXyRkLVpm9k2NU0H_ntTNsd8upd7zzn3cA8AlwTfEBIFt4SFBHkE-4gwlBKMUu8ITPfT44N-As76fo0xCSPsn4KJx7wgdNsp-Ew2oh6EVW0JBaxVK4WBX6iRBjY6lzUstIGd0VaqFt2_zKFqrTQis0q3PRz6kVepskK2Mnooq26wME2WyQfMhRXn4KQQdS8vdnUG3h-St8UTWr4-Pi_mS5QxjC0KA-b5GQtxJHwRkyz38yAqqMQiJ7HEQRFg6dEgJytGIopjKiLsyThmhQxjh6UzcLfV7YZVI_NMttaImndGNcL8cC0U_79pVcVLveE09DGLRoHrnYDR34PsLV_rwbTOMyfOm-9sBtSh6BaVGd33Rhb7CwTzMRM-fpyPH3cdd5nw1HOsq0Nze85fCPQX6xaHlg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1425540063</pqid></control><display><type>article</type><title>Evaluating a linear k-mer model for protein-DNA interactions using high-throughput SELEX data</title><source>MEDLINE</source><source>Springer Nature - Complete Springer Journals</source><source>DOAJ Directory of Open Access Journals</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><source>PubMed Central</source><source>PubMed Central Open Access</source><source>Springer Nature OA Free Journals</source><creator>Kähärä, Juhani ; Lähdesmäki, Harri</creator><creatorcontrib>Kähärä, Juhani ; Lähdesmäki, Harri</creatorcontrib><description>Transcription factor (TF) binding to DNA can be modeled in a number of different ways. It is highly debated which modeling methods are the best, how the models should be built and what can they be applied to. In this study a linear k-mer model proposed for predicting TF specificity in protein binding microarrays (PBM) is applied to a high-throughput SELEX data and the question of how to choose the most informative k-mers to the binding model is studied. We implemented the standard cross-validation scheme to reduce the number of k-mers in the model and observed that the number of k-mers can often be reduced significantly without a great negative effect on prediction accuracy. We also found that the later SELEX enrichment cycles provide a much better discrimination between bound and unbound sequences as model prediction accuracies increased for all proteins together with the cycle number. We compared prediction performance of k-mer and position specific weight matrix (PWM) models derived from the same SELEX data. Consistent with previous results on PBM data, performance of the k-mer model was on average 9%-units better. For the 15 proteins in the SELEX data set with medium enrichment cycles, classification accuracies were on average 71% and 62% for k-mer and PWMs, respectively. Finally, the k-mer model trained with SELEX data was evaluated on ChIP-seq data demonstrating substantial improvements for some proteins. For protein GATA1 the model can distinquish between true ChIP-seq peaks and negative peaks. For proteins RFX3 and NFATC1 the performance of the model was no better than chance.</description><identifier>ISSN: 1471-2105</identifier><identifier>EISSN: 1471-2105</identifier><identifier>DOI: 10.1186/1471-2105-14-S10-S2</identifier><identifier>PMID: 24267147</identifier><language>eng</language><publisher>England: BioMed Central</publisher><subject>Algorithms ; DNA - genetics ; DNA - metabolism ; DNA-Binding Proteins - genetics ; DNA-Binding Proteins - metabolism ; GATA1 Transcription Factor - genetics ; GATA1 Transcription Factor - metabolism ; High-Throughput Nucleotide Sequencing ; Humans ; Linear Models ; NFATC Transcription Factors - genetics ; NFATC Transcription Factors - metabolism ; Oligonucleotide Array Sequence Analysis ; Protein Binding - genetics ; Protein Interaction Mapping - methods ; Proteins - genetics ; Proteins - metabolism ; Regulatory Factor X Transcription Factors ; Transcription Factors - genetics ; Transcription Factors - metabolism</subject><ispartof>BMC bioinformatics, 2013-08, Vol.14 Suppl 10 (S10), p.S2-S2, Article S2</ispartof><rights>2013 Kähärä and Lähdesmäki; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</rights><rights>Copyright © 2013 Kähärä and Lähdesmäki; licensee BioMed Central Ltd. 2013 Kähärä and Lähdesmäki; licensee BioMed Central Ltd.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c400t-76425c4708a5a91cd5d68f3e0ad19e06f60e236d1b4183093a802e994fe795d63</citedby><cites>FETCH-LOGICAL-c400t-76425c4708a5a91cd5d68f3e0ad19e06f60e236d1b4183093a802e994fe795d63</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC3750486/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC3750486/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,723,776,780,860,881,27901,27902,53766,53768</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/24267147$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Kähärä, Juhani</creatorcontrib><creatorcontrib>Lähdesmäki, Harri</creatorcontrib><title>Evaluating a linear k-mer model for protein-DNA interactions using high-throughput SELEX data</title><title>BMC bioinformatics</title><addtitle>BMC Bioinformatics</addtitle><description>Transcription factor (TF) binding to DNA can be modeled in a number of different ways. It is highly debated which modeling methods are the best, how the models should be built and what can they be applied to. In this study a linear k-mer model proposed for predicting TF specificity in protein binding microarrays (PBM) is applied to a high-throughput SELEX data and the question of how to choose the most informative k-mers to the binding model is studied. We implemented the standard cross-validation scheme to reduce the number of k-mers in the model and observed that the number of k-mers can often be reduced significantly without a great negative effect on prediction accuracy. We also found that the later SELEX enrichment cycles provide a much better discrimination between bound and unbound sequences as model prediction accuracies increased for all proteins together with the cycle number. We compared prediction performance of k-mer and position specific weight matrix (PWM) models derived from the same SELEX data. Consistent with previous results on PBM data, performance of the k-mer model was on average 9%-units better. For the 15 proteins in the SELEX data set with medium enrichment cycles, classification accuracies were on average 71% and 62% for k-mer and PWMs, respectively. Finally, the k-mer model trained with SELEX data was evaluated on ChIP-seq data demonstrating substantial improvements for some proteins. For protein GATA1 the model can distinquish between true ChIP-seq peaks and negative peaks. For proteins RFX3 and NFATC1 the performance of the model was no better than chance.</description><subject>Algorithms</subject><subject>DNA - genetics</subject><subject>DNA - metabolism</subject><subject>DNA-Binding Proteins - genetics</subject><subject>DNA-Binding Proteins - metabolism</subject><subject>GATA1 Transcription Factor - genetics</subject><subject>GATA1 Transcription Factor - metabolism</subject><subject>High-Throughput Nucleotide Sequencing</subject><subject>Humans</subject><subject>Linear Models</subject><subject>NFATC Transcription Factors - genetics</subject><subject>NFATC Transcription Factors - metabolism</subject><subject>Oligonucleotide Array Sequence Analysis</subject><subject>Protein Binding - genetics</subject><subject>Protein Interaction Mapping - methods</subject><subject>Proteins - genetics</subject><subject>Proteins - metabolism</subject><subject>Regulatory Factor X Transcription Factors</subject><subject>Transcription Factors - genetics</subject><subject>Transcription Factors - metabolism</subject><issn>1471-2105</issn><issn>1471-2105</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2013</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><sourceid>BENPR</sourceid><recordid>eNpVUV1LwzAUDaK4Of0FggR8jiZN-vUijFk_YOhDFXyRkLVpm9k2NU0H_ntTNsd8upd7zzn3cA8AlwTfEBIFt4SFBHkE-4gwlBKMUu8ITPfT44N-As76fo0xCSPsn4KJx7wgdNsp-Ew2oh6EVW0JBaxVK4WBX6iRBjY6lzUstIGd0VaqFt2_zKFqrTQis0q3PRz6kVepskK2Mnooq26wME2WyQfMhRXn4KQQdS8vdnUG3h-St8UTWr4-Pi_mS5QxjC0KA-b5GQtxJHwRkyz38yAqqMQiJ7HEQRFg6dEgJytGIopjKiLsyThmhQxjh6UzcLfV7YZVI_NMttaImndGNcL8cC0U_79pVcVLveE09DGLRoHrnYDR34PsLV_rwbTOMyfOm-9sBtSh6BaVGd33Rhb7CwTzMRM-fpyPH3cdd5nw1HOsq0Nze85fCPQX6xaHlg</recordid><startdate>20130812</startdate><enddate>20130812</enddate><creator>Kähärä, Juhani</creator><creator>Lähdesmäki, Harri</creator><general>BioMed Central</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7QO</scope><scope>7SC</scope><scope>7X7</scope><scope>7XB</scope><scope>88E</scope><scope>8AL</scope><scope>8AO</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FH</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AEUYN</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>BHPHI</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FR3</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>K9.</scope><scope>L7M</scope><scope>LK8</scope><scope>L~C</scope><scope>L~D</scope><scope>M0N</scope><scope>M0S</scope><scope>M1P</scope><scope>M7P</scope><scope>P5Z</scope><scope>P62</scope><scope>P64</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>Q9U</scope><scope>5PM</scope></search><sort><creationdate>20130812</creationdate><title>Evaluating a linear k-mer model for protein-DNA interactions using high-throughput SELEX data</title><author>Kähärä, Juhani ; Lähdesmäki, Harri</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c400t-76425c4708a5a91cd5d68f3e0ad19e06f60e236d1b4183093a802e994fe795d63</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2013</creationdate><topic>Algorithms</topic><topic>DNA - genetics</topic><topic>DNA - metabolism</topic><topic>DNA-Binding Proteins - genetics</topic><topic>DNA-Binding Proteins - metabolism</topic><topic>GATA1 Transcription Factor - genetics</topic><topic>GATA1 Transcription Factor - metabolism</topic><topic>High-Throughput Nucleotide Sequencing</topic><topic>Humans</topic><topic>Linear Models</topic><topic>NFATC Transcription Factors - genetics</topic><topic>NFATC Transcription Factors - metabolism</topic><topic>Oligonucleotide Array Sequence Analysis</topic><topic>Protein Binding - genetics</topic><topic>Protein Interaction Mapping - methods</topic><topic>Proteins - genetics</topic><topic>Proteins - metabolism</topic><topic>Regulatory Factor X Transcription Factors</topic><topic>Transcription Factors - genetics</topic><topic>Transcription Factors - metabolism</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Kähärä, Juhani</creatorcontrib><creatorcontrib>Lähdesmäki, Harri</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Biotechnology Research Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>Health & Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Medical Database (Alumni Edition)</collection><collection>Computing Database (Alumni Edition)</collection><collection>ProQuest Pharma Collection</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest One Sustainability</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>Natural Science Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Engineering Research Database</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>ProQuest Biological Science Collection</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Computing Database</collection><collection>Health & Medical Collection (Alumni Edition)</collection><collection>Medical Database</collection><collection>Biological Science Database</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>ProQuest Central Basic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>BMC bioinformatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kähärä, Juhani</au><au>Lähdesmäki, Harri</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Evaluating a linear k-mer model for protein-DNA interactions using high-throughput SELEX data</atitle><jtitle>BMC bioinformatics</jtitle><addtitle>BMC Bioinformatics</addtitle><date>2013-08-12</date><risdate>2013</risdate><volume>14 Suppl 10</volume><issue>S10</issue><spage>S2</spage><epage>S2</epage><pages>S2-S2</pages><artnum>S2</artnum><issn>1471-2105</issn><eissn>1471-2105</eissn><abstract>Transcription factor (TF) binding to DNA can be modeled in a number of different ways. It is highly debated which modeling methods are the best, how the models should be built and what can they be applied to. In this study a linear k-mer model proposed for predicting TF specificity in protein binding microarrays (PBM) is applied to a high-throughput SELEX data and the question of how to choose the most informative k-mers to the binding model is studied. We implemented the standard cross-validation scheme to reduce the number of k-mers in the model and observed that the number of k-mers can often be reduced significantly without a great negative effect on prediction accuracy. We also found that the later SELEX enrichment cycles provide a much better discrimination between bound and unbound sequences as model prediction accuracies increased for all proteins together with the cycle number. We compared prediction performance of k-mer and position specific weight matrix (PWM) models derived from the same SELEX data. Consistent with previous results on PBM data, performance of the k-mer model was on average 9%-units better. For the 15 proteins in the SELEX data set with medium enrichment cycles, classification accuracies were on average 71% and 62% for k-mer and PWMs, respectively. Finally, the k-mer model trained with SELEX data was evaluated on ChIP-seq data demonstrating substantial improvements for some proteins. For protein GATA1 the model can distinquish between true ChIP-seq peaks and negative peaks. For proteins RFX3 and NFATC1 the performance of the model was no better than chance.</abstract><cop>England</cop><pub>BioMed Central</pub><pmid>24267147</pmid><doi>10.1186/1471-2105-14-S10-S2</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1471-2105 |
ispartof | BMC bioinformatics, 2013-08, Vol.14 Suppl 10 (S10), p.S2-S2, Article S2 |
issn | 1471-2105 1471-2105 |
language | eng |
recordid | cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_3750486 |
source | MEDLINE; Springer Nature - Complete Springer Journals; DOAJ Directory of Open Access Journals; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals; PubMed Central; PubMed Central Open Access; Springer Nature OA Free Journals |
subjects | Algorithms DNA - genetics DNA - metabolism DNA-Binding Proteins - genetics DNA-Binding Proteins - metabolism GATA1 Transcription Factor - genetics GATA1 Transcription Factor - metabolism High-Throughput Nucleotide Sequencing Humans Linear Models NFATC Transcription Factors - genetics NFATC Transcription Factors - metabolism Oligonucleotide Array Sequence Analysis Protein Binding - genetics Protein Interaction Mapping - methods Proteins - genetics Proteins - metabolism Regulatory Factor X Transcription Factors Transcription Factors - genetics Transcription Factors - metabolism |
title | Evaluating a linear k-mer model for protein-DNA interactions using high-throughput SELEX data |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-10T22%3A50%3A06IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Evaluating%20a%20linear%20k-mer%20model%20for%20protein-DNA%20interactions%20using%20high-throughput%20SELEX%20data&rft.jtitle=BMC%20bioinformatics&rft.au=K%C3%A4h%C3%A4r%C3%A4,%20Juhani&rft.date=2013-08-12&rft.volume=14%20Suppl%2010&rft.issue=S10&rft.spage=S2&rft.epage=S2&rft.pages=S2-S2&rft.artnum=S2&rft.issn=1471-2105&rft.eissn=1471-2105&rft_id=info:doi/10.1186/1471-2105-14-S10-S2&rft_dat=%3Cproquest_pubme%3E3047764671%3C/proquest_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1425540063&rft_id=info:pmid/24267147&rfr_iscdi=true |