Mono-through hexanucleotide composition of the Escherichia coli genome: a Markov chain analysis

Several statistical methods were tested for accuracy in predicting observed frequencies of di- through hexanucleotides in 74,444 bp of E. coli DNA. A Markov chain was most accurate overall, whereas other methods, including a random model based on mononucleotide frequencies, were very inaccurate. Whe...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Nucleic acids research 1987-03, Vol.15 (6), p.2611-2626
Hauptverfasser:	PHILLIPS, G. J, ARNOLD, J, IVARIE, R
Format:	Artikel
Sprache:	eng
Schlagworte:	Bacteriology Base Composition Base Sequence Biological and medical sciences Escherichia coli Escherichia coli - genetics Fundamental and applied biological sciences. Psychology Genes, Bacterial Genetic Complementation Test Genetics Methods Microbiology Oligodeoxyribonucleotides Operon
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	2626
container_issue	6
container_start_page	2611
container_title	Nucleic acids research
container_volume	15
creator	PHILLIPS, G. J ARNOLD, J IVARIE, R
description	Several statistical methods were tested for accuracy in predicting observed frequencies of di- through hexanucleotides in 74,444 bp of E. coli DNA. A Markov chain was most accurate overall, whereas other methods, including a random model based on mononucleotide frequencies, were very inaccurate. When ranked highest to lowest abundance, the observed frequencies of oligonucleotides up to six bases in length in E. coli DNA were highly asymmetric. All ordered abundance plots had a wide linear range containing the majority of the oligomers which deviated sharply at the high and low ends of the curves. In general, values predicted by a Markov chain closely followed the overall shape of the ordered abundance curves. A simple equation was derived by which the frequency of any nucleotide longer than four bases in the E. coli genome (or any genome) can be relatively accurately estimated from the nested set of component tri- and tetranucleotides by serial application of a 3rd order Markov chain. The equation yielded a mean ratio of 1.03 +/- 0.94 for the observed-to-expected frequencies of the 4,096 hexanucleotides. Hence, the method is a relatively accurate but not perfect predictor of the length in nucleotides between hexanucleotide sites. Higher accuracy can be achieved using a 4th order Markov chain and larger data sets. The high asymmetry in oligonucleotide abundance means that in the E. coli genome of 4.2 X 10(6) bp many relatively short sequences of 7-9 bp are very rare or absent.
doi_str_mv	10.1093/nar/15.6.2611
format	Article
fullrecord	<record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_340672</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>77469604</sourcerecordid><originalsourceid>FETCH-LOGICAL-c4531-4f2b43eda9a54ed04c1989c41350c358d630467aede844612eb4ed2456df51c23</originalsourceid><addsrcrecordid>eNqFkUFv1DAQha0KVJbSI0ckH1Bv2drx2EmQOKCqFKRWXOBszTqTjSGxFzup6L9vVl2tyqmnObzvPc3MY-y9FGspGnUZMF1KvTbr0kh5wlZSmbKAxpSv2EoooQspoH7D3ub8WwgJUsMpO1VaC9M0K2bvYojF1Kc4b3ve0z8MsxsoTr4l7uK4i9lPPgYeOz71xK-z6yl513tc5MHzLYU40ieO_A7Tn3jPXY8-cAw4PGSf37HXHQ6Zzg_zjP36ev3z6ltx--Pm-9WX28KBVrKArtyAohYb1ECtACebunEglRZO6bo1SoCpkFqqAYwsabNgJWjTdlq6Up2xz0-5u3kzUusoTAkHu0t-xPRgI3r7vxJ8b7fx3ioQptr7Lw7-FP_OlCc7-uxoGDBQnLOtKjCNEfAiKJfPy6qqXwahrvSSuYDFE-hSzDlRd9xaCruv2C4VW6mtsfuKF_7D81OP9KHTRf940DE7HLqEwfl8xCrQWpWVegSBT7A9</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>14875696</pqid></control><display><type>article</type><title>Mono-through hexanucleotide composition of the Escherichia coli genome: a Markov chain analysis</title><source>MEDLINE</source><source>Oxford University Press Journals Digital Archive Legacy</source><source>PubMed Central</source><creator>PHILLIPS, G. J ; ARNOLD, J ; IVARIE, R</creator><creatorcontrib>PHILLIPS, G. J ; ARNOLD, J ; IVARIE, R</creatorcontrib><description>Several statistical methods were tested for accuracy in predicting observed frequencies of di- through hexanucleotides in 74,444 bp of E. coli DNA. A Markov chain was most accurate overall, whereas other methods, including a random model based on mononucleotide frequencies, were very inaccurate. When ranked highest to lowest abundance, the observed frequencies of oligonucleotides up to six bases in length in E. coli DNA were highly asymmetric. All ordered abundance plots had a wide linear range containing the majority of the oligomers which deviated sharply at the high and low ends of the curves. In general, values predicted by a Markov chain closely followed the overall shape of the ordered abundance curves. A simple equation was derived by which the frequency of any nucleotide longer than four bases in the E. coli genome (or any genome) can be relatively accurately estimated from the nested set of component tri- and tetranucleotides by serial application of a 3rd order Markov chain. The equation yielded a mean ratio of 1.03 +/- 0.94 for the observed-to-expected frequencies of the 4,096 hexanucleotides. Hence, the method is a relatively accurate but not perfect predictor of the length in nucleotides between hexanucleotide sites. Higher accuracy can be achieved using a 4th order Markov chain and larger data sets. The high asymmetry in oligonucleotide abundance means that in the E. coli genome of 4.2 X 10(6) bp many relatively short sequences of 7-9 bp are very rare or absent.</description><identifier>ISSN: 0305-1048</identifier><identifier>EISSN: 1362-4962</identifier><identifier>DOI: 10.1093/nar/15.6.2611</identifier><identifier>PMID: 3550699</identifier><identifier>CODEN: NARHAD</identifier><language>eng</language><publisher>Oxford: Oxford University Press</publisher><subject>Bacteriology ; Base Composition ; Base Sequence ; Biological and medical sciences ; Escherichia coli ; Escherichia coli - genetics ; Fundamental and applied biological sciences. Psychology ; Genes, Bacterial ; Genetic Complementation Test ; Genetics ; Methods ; Microbiology ; Oligodeoxyribonucleotides ; Operon</subject><ispartof>Nucleic acids research, 1987-03, Vol.15 (6), p.2611-2626</ispartof><rights>1988 INIST-CNRS</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c4531-4f2b43eda9a54ed04c1989c41350c358d630467aede844612eb4ed2456df51c23</citedby></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC340672/pdf/$$EPDF$$P50$$Gpubmedcentral$$H</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC340672/$$EHTML$$P50$$Gpubmedcentral$$H</linktohtml><link.rule.ids>230,314,727,780,784,885,27924,27925,53791,53793</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=7455327$$DView record in Pascal Francis$$Hfree_for_read</backlink><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/3550699$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>PHILLIPS, G. J</creatorcontrib><creatorcontrib>ARNOLD, J</creatorcontrib><creatorcontrib>IVARIE, R</creatorcontrib><title>Mono-through hexanucleotide composition of the Escherichia coli genome: a Markov chain analysis</title><title>Nucleic acids research</title><addtitle>Nucleic Acids Res</addtitle><description>Several statistical methods were tested for accuracy in predicting observed frequencies of di- through hexanucleotides in 74,444 bp of E. coli DNA. A Markov chain was most accurate overall, whereas other methods, including a random model based on mononucleotide frequencies, were very inaccurate. When ranked highest to lowest abundance, the observed frequencies of oligonucleotides up to six bases in length in E. coli DNA were highly asymmetric. All ordered abundance plots had a wide linear range containing the majority of the oligomers which deviated sharply at the high and low ends of the curves. In general, values predicted by a Markov chain closely followed the overall shape of the ordered abundance curves. A simple equation was derived by which the frequency of any nucleotide longer than four bases in the E. coli genome (or any genome) can be relatively accurately estimated from the nested set of component tri- and tetranucleotides by serial application of a 3rd order Markov chain. The equation yielded a mean ratio of 1.03 +/- 0.94 for the observed-to-expected frequencies of the 4,096 hexanucleotides. Hence, the method is a relatively accurate but not perfect predictor of the length in nucleotides between hexanucleotide sites. Higher accuracy can be achieved using a 4th order Markov chain and larger data sets. The high asymmetry in oligonucleotide abundance means that in the E. coli genome of 4.2 X 10(6) bp many relatively short sequences of 7-9 bp are very rare or absent.</description><subject>Bacteriology</subject><subject>Base Composition</subject><subject>Base Sequence</subject><subject>Biological and medical sciences</subject><subject>Escherichia coli</subject><subject>Escherichia coli - genetics</subject><subject>Fundamental and applied biological sciences. Psychology</subject><subject>Genes, Bacterial</subject><subject>Genetic Complementation Test</subject><subject>Genetics</subject><subject>Methods</subject><subject>Microbiology</subject><subject>Oligodeoxyribonucleotides</subject><subject>Operon</subject><issn>0305-1048</issn><issn>1362-4962</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>1987</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><recordid>eNqFkUFv1DAQha0KVJbSI0ckH1Bv2drx2EmQOKCqFKRWXOBszTqTjSGxFzup6L9vVl2tyqmnObzvPc3MY-y9FGspGnUZMF1KvTbr0kh5wlZSmbKAxpSv2EoooQspoH7D3ub8WwgJUsMpO1VaC9M0K2bvYojF1Kc4b3ve0z8MsxsoTr4l7uK4i9lPPgYeOz71xK-z6yl513tc5MHzLYU40ieO_A7Tn3jPXY8-cAw4PGSf37HXHQ6Zzg_zjP36ev3z6ltx--Pm-9WX28KBVrKArtyAohYb1ECtACebunEglRZO6bo1SoCpkFqqAYwsabNgJWjTdlq6Up2xz0-5u3kzUusoTAkHu0t-xPRgI3r7vxJ8b7fx3ioQptr7Lw7-FP_OlCc7-uxoGDBQnLOtKjCNEfAiKJfPy6qqXwahrvSSuYDFE-hSzDlRd9xaCruv2C4VW6mtsfuKF_7D81OP9KHTRf940DE7HLqEwfl8xCrQWpWVegSBT7A9</recordid><startdate>19870325</startdate><enddate>19870325</enddate><creator>PHILLIPS, G. J</creator><creator>ARNOLD, J</creator><creator>IVARIE, R</creator><general>Oxford University Press</general><scope>IQODW</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QL</scope><scope>7TM</scope><scope>8FD</scope><scope>C1K</scope><scope>FR3</scope><scope>P64</scope><scope>RC3</scope><scope>7X8</scope><scope>5PM</scope></search><sort><creationdate>19870325</creationdate><title>Mono-through hexanucleotide composition of the Escherichia coli genome: a Markov chain analysis</title><author>PHILLIPS, G. J ; ARNOLD, J ; IVARIE, R</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c4531-4f2b43eda9a54ed04c1989c41350c358d630467aede844612eb4ed2456df51c23</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>1987</creationdate><topic>Bacteriology</topic><topic>Base Composition</topic><topic>Base Sequence</topic><topic>Biological and medical sciences</topic><topic>Escherichia coli</topic><topic>Escherichia coli - genetics</topic><topic>Fundamental and applied biological sciences. Psychology</topic><topic>Genes, Bacterial</topic><topic>Genetic Complementation Test</topic><topic>Genetics</topic><topic>Methods</topic><topic>Microbiology</topic><topic>Oligodeoxyribonucleotides</topic><topic>Operon</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>PHILLIPS, G. J</creatorcontrib><creatorcontrib>ARNOLD, J</creatorcontrib><creatorcontrib>IVARIE, R</creatorcontrib><collection>Pascal-Francis</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Bacteriology Abstracts (Microbiology B)</collection><collection>Nucleic Acids Abstracts</collection><collection>Technology Research Database</collection><collection>Environmental Sciences and Pollution Management</collection><collection>Engineering Research Database</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Genetics Abstracts</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Nucleic acids research</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>PHILLIPS, G. J</au><au>ARNOLD, J</au><au>IVARIE, R</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Mono-through hexanucleotide composition of the Escherichia coli genome: a Markov chain analysis</atitle><jtitle>Nucleic acids research</jtitle><addtitle>Nucleic Acids Res</addtitle><date>1987-03-25</date><risdate>1987</risdate><volume>15</volume><issue>6</issue><spage>2611</spage><epage>2626</epage><pages>2611-2626</pages><issn>0305-1048</issn><eissn>1362-4962</eissn><coden>NARHAD</coden><abstract>Several statistical methods were tested for accuracy in predicting observed frequencies of di- through hexanucleotides in 74,444 bp of E. coli DNA. A Markov chain was most accurate overall, whereas other methods, including a random model based on mononucleotide frequencies, were very inaccurate. When ranked highest to lowest abundance, the observed frequencies of oligonucleotides up to six bases in length in E. coli DNA were highly asymmetric. All ordered abundance plots had a wide linear range containing the majority of the oligomers which deviated sharply at the high and low ends of the curves. In general, values predicted by a Markov chain closely followed the overall shape of the ordered abundance curves. A simple equation was derived by which the frequency of any nucleotide longer than four bases in the E. coli genome (or any genome) can be relatively accurately estimated from the nested set of component tri- and tetranucleotides by serial application of a 3rd order Markov chain. The equation yielded a mean ratio of 1.03 +/- 0.94 for the observed-to-expected frequencies of the 4,096 hexanucleotides. Hence, the method is a relatively accurate but not perfect predictor of the length in nucleotides between hexanucleotide sites. Higher accuracy can be achieved using a 4th order Markov chain and larger data sets. The high asymmetry in oligonucleotide abundance means that in the E. coli genome of 4.2 X 10(6) bp many relatively short sequences of 7-9 bp are very rare or absent.</abstract><cop>Oxford</cop><pub>Oxford University Press</pub><pmid>3550699</pmid><doi>10.1093/nar/15.6.2611</doi><tpages>16</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 0305-1048
ispartof	Nucleic acids research, 1987-03, Vol.15 (6), p.2611-2626
issn	0305-1048 1362-4962
language	eng
recordid	cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_340672
source	MEDLINE; Oxford University Press Journals Digital Archive Legacy; PubMed Central
subjects	Bacteriology Base Composition Base Sequence Biological and medical sciences Escherichia coli Escherichia coli - genetics Fundamental and applied biological sciences. Psychology Genes, Bacterial Genetic Complementation Test Genetics Methods Microbiology Oligodeoxyribonucleotides Operon
title	Mono-through hexanucleotide composition of the Escherichia coli genome: a Markov chain analysis
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T04%3A54%3A25IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Mono-through%20hexanucleotide%20composition%20of%20the%20Escherichia%20coli%20genome:%20a%20Markov%20chain%20analysis&rft.jtitle=Nucleic%20acids%20research&rft.au=PHILLIPS,%20G.%20J&rft.date=1987-03-25&rft.volume=15&rft.issue=6&rft.spage=2611&rft.epage=2626&rft.pages=2611-2626&rft.issn=0305-1048&rft.eissn=1362-4962&rft.coden=NARHAD&rft_id=info:doi/10.1093/nar/15.6.2611&rft_dat=%3Cproquest_pubme%3E77469604%3C/proquest_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=14875696&rft_id=info:pmid/3550699&rfr_iscdi=true