Composition Vector Method Based on Maximum Entropy Principle for Sequence Comparison

The composition vector (CV) method is an alignment-free method for sequence comparison. Because of its simplicity when compared with multiple sequence alignment methods, the method has been widely discussed lately; and some formulas based on probabilistic models, like Hao's and Yu's formul...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE/ACM transactions on computational biology and bioinformatics 2012-01, Vol.9 (1), p.79-87
Hauptverfasser:	Chan, R. H., Chan, T. H., Hau Man Yeung, Wang, R. W.
Format:	Artikel
Sprache:	eng
Schlagworte:	alignment-free sequence comparison Animals Bacteria - classification Bacteria - genetics Bioinformatics Composition vector method Computational Biology - methods Computational modeling Computer Simulation Databases, Genetic Entropy Estimation Humans Markov Chains maximum entropy principle Models, Genetic Optimization optimization model phylogenetics Phylogeny Sequence Analysis, DNA - methods Strain Studies
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	87
container_issue	1
container_start_page	79
container_title	IEEE/ACM transactions on computational biology and bioinformatics
container_volume	9
creator	Chan, R. H. Chan, T. H. Hau Man Yeung Wang, R. W.
description	The composition vector (CV) method is an alignment-free method for sequence comparison. Because of its simplicity when compared with multiple sequence alignment methods, the method has been widely discussed lately; and some formulas based on probabilistic models, like Hao's and Yu's formulas, have been proposed. In this paper, we improve these formulas by using the entropy principle which can quantify the nonrandomness occurrence of patterns in the sequences. More precisely, existing formulas are used to generate a set of possible formulas from which we choose the one that maximizes the entropy. We give the closed-form solution to the resulting optimization problem. Hence, from any given CV formula, we can find the corresponding one that maximizes the entropy. In particular, we show that Hao's formula is itself maximizing the entropy and we derive a new entropy-maximizing formula from Yu's formula. We illustrate the accuracy of our new formula by using both simulated and experimental data sets. For the simulated data sets, our new formula gives the best consensus and significant values for three different kinds of evolution models. For the data set of tetrapod 18S rRNA sequences, our new formula groups the clades of bird and reptile together correctly, where Hao's and Yu's formulas failed. Using real data sets with different sizes, we show that our formula is more accurate than Hao's and Yu's formulas even for small data sets.
doi_str_mv	10.1109/TCBB.2011.45
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_miscellaneous_920807200</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>5728790</ieee_id><sourcerecordid>2525162811</sourcerecordid><originalsourceid>FETCH-LOGICAL-c370t-47d6ec1b284b49f71d24a5aa6da5533fd6a8cc7231ffbb41526bcddafbbe58c63</originalsourceid><addsrcrecordid>eNp90c1LHDEYBvBQKtVqb70VyuClHpxtvjM5dhdrBUWhW68hk7xDIzuTaTID-t83w6qHHnrK14-HNzwIfSR4RQjWX7eb9XpFMSErLt6gIyKEqrWW_O2y56IWWrJD9D7nB4wp15i_Q4eUsIZxIo_QdhP7MeYwhThU9-CmmKobmH5HX61tBl-V6xv7GPq5ry6GKcXxqbpLYXBh3EHVFf0T_swwOKiWJJtCjsMJOujsLsOH5_UY_fp-sd38qK9vL682365rxxSeaq68BEda2vCW604RT7kV1kpvhWCs89I2zinKSNe1LSeCytZ5b8sBROMkO0Zf9rljimWIPJk-ZAe7nR0gztloihusKMZFnv1XEs6IJJooVejpP_Qhzmko_zC6hImGCl7Q-R65FHNO0Jkxhd6mJ0OwWWoxSy1mqcVwUfjn58y57cG_4pceCvi0BwEAXp-Foo3SmP0FebSQmg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>907258254</pqid></control><display><type>article</type><title>Composition Vector Method Based on Maximum Entropy Principle for Sequence Comparison</title><source>IEEE Electronic Library (IEL)</source><creator>Chan, R. H. ; Chan, T. H. ; Hau Man Yeung ; Wang, R. W.</creator><creatorcontrib>Chan, R. H. ; Chan, T. H. ; Hau Man Yeung ; Wang, R. W.</creatorcontrib><description>The composition vector (CV) method is an alignment-free method for sequence comparison. Because of its simplicity when compared with multiple sequence alignment methods, the method has been widely discussed lately; and some formulas based on probabilistic models, like Hao's and Yu's formulas, have been proposed. In this paper, we improve these formulas by using the entropy principle which can quantify the nonrandomness occurrence of patterns in the sequences. More precisely, existing formulas are used to generate a set of possible formulas from which we choose the one that maximizes the entropy. We give the closed-form solution to the resulting optimization problem. Hence, from any given CV formula, we can find the corresponding one that maximizes the entropy. In particular, we show that Hao's formula is itself maximizing the entropy and we derive a new entropy-maximizing formula from Yu's formula. We illustrate the accuracy of our new formula by using both simulated and experimental data sets. For the simulated data sets, our new formula gives the best consensus and significant values for three different kinds of evolution models. For the data set of tetrapod 18S rRNA sequences, our new formula groups the clades of bird and reptile together correctly, where Hao's and Yu's formulas failed. Using real data sets with different sizes, we show that our formula is more accurate than Hao's and Yu's formulas even for small data sets.</description><identifier>ISSN: 1545-5963</identifier><identifier>EISSN: 1557-9964</identifier><identifier>DOI: 10.1109/TCBB.2011.45</identifier><identifier>PMID: 21383416</identifier><identifier>CODEN: ITCBCY</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>alignment-free sequence comparison ; Animals ; Bacteria - classification ; Bacteria - genetics ; Bioinformatics ; Composition vector method ; Computational Biology - methods ; Computational modeling ; Computer Simulation ; Databases, Genetic ; Entropy ; Estimation ; Humans ; Markov Chains ; maximum entropy principle ; Models, Genetic ; Optimization ; optimization model ; phylogenetics ; Phylogeny ; Sequence Analysis, DNA - methods ; Strain ; Studies</subject><ispartof>IEEE/ACM transactions on computational biology and bioinformatics, 2012-01, Vol.9 (1), p.79-87</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Jan/Feb 2012</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c370t-47d6ec1b284b49f71d24a5aa6da5533fd6a8cc7231ffbb41526bcddafbbe58c63</citedby><cites>FETCH-LOGICAL-c370t-47d6ec1b284b49f71d24a5aa6da5533fd6a8cc7231ffbb41526bcddafbbe58c63</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/5728790$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,777,781,793,27905,27906,54739</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/5728790$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/21383416$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Chan, R. H.</creatorcontrib><creatorcontrib>Chan, T. H.</creatorcontrib><creatorcontrib>Hau Man Yeung</creatorcontrib><creatorcontrib>Wang, R. W.</creatorcontrib><title>Composition Vector Method Based on Maximum Entropy Principle for Sequence Comparison</title><title>IEEE/ACM transactions on computational biology and bioinformatics</title><addtitle>TCBB</addtitle><addtitle>IEEE/ACM Trans Comput Biol Bioinform</addtitle><description>The composition vector (CV) method is an alignment-free method for sequence comparison. Because of its simplicity when compared with multiple sequence alignment methods, the method has been widely discussed lately; and some formulas based on probabilistic models, like Hao's and Yu's formulas, have been proposed. In this paper, we improve these formulas by using the entropy principle which can quantify the nonrandomness occurrence of patterns in the sequences. More precisely, existing formulas are used to generate a set of possible formulas from which we choose the one that maximizes the entropy. We give the closed-form solution to the resulting optimization problem. Hence, from any given CV formula, we can find the corresponding one that maximizes the entropy. In particular, we show that Hao's formula is itself maximizing the entropy and we derive a new entropy-maximizing formula from Yu's formula. We illustrate the accuracy of our new formula by using both simulated and experimental data sets. For the simulated data sets, our new formula gives the best consensus and significant values for three different kinds of evolution models. For the data set of tetrapod 18S rRNA sequences, our new formula groups the clades of bird and reptile together correctly, where Hao's and Yu's formulas failed. Using real data sets with different sizes, we show that our formula is more accurate than Hao's and Yu's formulas even for small data sets.</description><subject>alignment-free sequence comparison</subject><subject>Animals</subject><subject>Bacteria - classification</subject><subject>Bacteria - genetics</subject><subject>Bioinformatics</subject><subject>Composition vector method</subject><subject>Computational Biology - methods</subject><subject>Computational modeling</subject><subject>Computer Simulation</subject><subject>Databases, Genetic</subject><subject>Entropy</subject><subject>Estimation</subject><subject>Humans</subject><subject>Markov Chains</subject><subject>maximum entropy principle</subject><subject>Models, Genetic</subject><subject>Optimization</subject><subject>optimization model</subject><subject>phylogenetics</subject><subject>Phylogeny</subject><subject>Sequence Analysis, DNA - methods</subject><subject>Strain</subject><subject>Studies</subject><issn>1545-5963</issn><issn>1557-9964</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2012</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><sourceid>EIF</sourceid><recordid>eNp90c1LHDEYBvBQKtVqb70VyuClHpxtvjM5dhdrBUWhW68hk7xDIzuTaTID-t83w6qHHnrK14-HNzwIfSR4RQjWX7eb9XpFMSErLt6gIyKEqrWW_O2y56IWWrJD9D7nB4wp15i_Q4eUsIZxIo_QdhP7MeYwhThU9-CmmKobmH5HX61tBl-V6xv7GPq5ry6GKcXxqbpLYXBh3EHVFf0T_swwOKiWJJtCjsMJOujsLsOH5_UY_fp-sd38qK9vL682365rxxSeaq68BEda2vCW604RT7kV1kpvhWCs89I2zinKSNe1LSeCytZ5b8sBROMkO0Zf9rljimWIPJk-ZAe7nR0gztloihusKMZFnv1XEs6IJJooVejpP_Qhzmko_zC6hImGCl7Q-R65FHNO0Jkxhd6mJ0OwWWoxSy1mqcVwUfjn58y57cG_4pceCvi0BwEAXp-Foo3SmP0FebSQmg</recordid><startdate>201201</startdate><enddate>201201</enddate><creator>Chan, R. H.</creator><creator>Chan, T. H.</creator><creator>Hau Man Yeung</creator><creator>Wang, R. W.</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QF</scope><scope>7QO</scope><scope>7QQ</scope><scope>7SC</scope><scope>7SE</scope><scope>7SP</scope><scope>7SR</scope><scope>7TA</scope><scope>7TB</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>F28</scope><scope>FR3</scope><scope>H8D</scope><scope>JG9</scope><scope>JQ2</scope><scope>KR7</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>P64</scope><scope>7X8</scope></search><sort><creationdate>201201</creationdate><title>Composition Vector Method Based on Maximum Entropy Principle for Sequence Comparison</title><author>Chan, R. H. ; Chan, T. H. ; Hau Man Yeung ; Wang, R. W.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c370t-47d6ec1b284b49f71d24a5aa6da5533fd6a8cc7231ffbb41526bcddafbbe58c63</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2012</creationdate><topic>alignment-free sequence comparison</topic><topic>Animals</topic><topic>Bacteria - classification</topic><topic>Bacteria - genetics</topic><topic>Bioinformatics</topic><topic>Composition vector method</topic><topic>Computational Biology - methods</topic><topic>Computational modeling</topic><topic>Computer Simulation</topic><topic>Databases, Genetic</topic><topic>Entropy</topic><topic>Estimation</topic><topic>Humans</topic><topic>Markov Chains</topic><topic>maximum entropy principle</topic><topic>Models, Genetic</topic><topic>Optimization</topic><topic>optimization model</topic><topic>phylogenetics</topic><topic>Phylogeny</topic><topic>Sequence Analysis, DNA - methods</topic><topic>Strain</topic><topic>Studies</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Chan, R. H.</creatorcontrib><creatorcontrib>Chan, T. H.</creatorcontrib><creatorcontrib>Hau Man Yeung</creatorcontrib><creatorcontrib>Wang, R. W.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Aluminium Industry Abstracts</collection><collection>Biotechnology Research Abstracts</collection><collection>Ceramic Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>Corrosion Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Materials Business File</collection><collection>Mechanical & Transportation Engineering Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>ANTE: Abstracts in New Technology & Engineering</collection><collection>Engineering Research Database</collection><collection>Aerospace Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Civil Engineering Abstracts</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE/ACM transactions on computational biology and bioinformatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Chan, R. H.</au><au>Chan, T. H.</au><au>Hau Man Yeung</au><au>Wang, R. W.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Composition Vector Method Based on Maximum Entropy Principle for Sequence Comparison</atitle><jtitle>IEEE/ACM transactions on computational biology and bioinformatics</jtitle><stitle>TCBB</stitle><addtitle>IEEE/ACM Trans Comput Biol Bioinform</addtitle><date>2012-01</date><risdate>2012</risdate><volume>9</volume><issue>1</issue><spage>79</spage><epage>87</epage><pages>79-87</pages><issn>1545-5963</issn><eissn>1557-9964</eissn><coden>ITCBCY</coden><abstract>The composition vector (CV) method is an alignment-free method for sequence comparison. Because of its simplicity when compared with multiple sequence alignment methods, the method has been widely discussed lately; and some formulas based on probabilistic models, like Hao's and Yu's formulas, have been proposed. In this paper, we improve these formulas by using the entropy principle which can quantify the nonrandomness occurrence of patterns in the sequences. More precisely, existing formulas are used to generate a set of possible formulas from which we choose the one that maximizes the entropy. We give the closed-form solution to the resulting optimization problem. Hence, from any given CV formula, we can find the corresponding one that maximizes the entropy. In particular, we show that Hao's formula is itself maximizing the entropy and we derive a new entropy-maximizing formula from Yu's formula. We illustrate the accuracy of our new formula by using both simulated and experimental data sets. For the simulated data sets, our new formula gives the best consensus and significant values for three different kinds of evolution models. For the data set of tetrapod 18S rRNA sequences, our new formula groups the clades of bird and reptile together correctly, where Hao's and Yu's formulas failed. Using real data sets with different sizes, we show that our formula is more accurate than Hao's and Yu's formulas even for small data sets.</abstract><cop>United States</cop><pub>IEEE</pub><pmid>21383416</pmid><doi>10.1109/TCBB.2011.45</doi><tpages>9</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1545-5963
ispartof	IEEE/ACM transactions on computational biology and bioinformatics, 2012-01, Vol.9 (1), p.79-87
issn	1545-5963 1557-9964
language	eng
recordid	cdi_proquest_miscellaneous_920807200
source	IEEE Electronic Library (IEL)
subjects	alignment-free sequence comparison Animals Bacteria - classification Bacteria - genetics Bioinformatics Composition vector method Computational Biology - methods Computational modeling Computer Simulation Databases, Genetic Entropy Estimation Humans Markov Chains maximum entropy principle Models, Genetic Optimization optimization model phylogenetics Phylogeny Sequence Analysis, DNA - methods Strain Studies
title	Composition Vector Method Based on Maximum Entropy Principle for Sequence Comparison
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-18T10%3A26%3A27IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Composition%20Vector%20Method%20Based%20on%20Maximum%20Entropy%20Principle%20for%20Sequence%20Comparison&rft.jtitle=IEEE/ACM%20transactions%20on%20computational%20biology%20and%20bioinformatics&rft.au=Chan,%20R.%20H.&rft.date=2012-01&rft.volume=9&rft.issue=1&rft.spage=79&rft.epage=87&rft.pages=79-87&rft.issn=1545-5963&rft.eissn=1557-9964&rft.coden=ITCBCY&rft_id=info:doi/10.1109/TCBB.2011.45&rft_dat=%3Cproquest_RIE%3E2525162811%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=907258254&rft_id=info:pmid/21383416&rft_ieee_id=5728790&rfr_iscdi=true