Deep Robust Framework for Protein Function Prediction Using Variable-Length Protein Sequences

The order of amino acids in a protein sequence enables the protein to acquire a conformation suitable for performing functions, thereby motivating the need to analyze these sequences for predicting functions. Although machine learning based approaches are fast compared to methods using BLAST, FASTA,...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE/ACM transactions on computational biology and bioinformatics 2020-09, Vol.17 (5), p.1648-1659
Hauptverfasser:	Ranjan, Ashish, Fahad, Md Shah, Fernandez-Baca, David, Deepak, Akshay, Tripathi, Sudhakar
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy Amino acid sequence Amino acids Bi-directional long short-term memory (Bi-LSTM) Bidirectional control Biological activity Biological processes Biological system modeling Discriminant analysis Learning algorithms long protein sequence Long short-term memory Machine learning Model accuracy multi-label linear discriminant analysis (MLDA) Organisms protein segment vector Protein sequence Protein structure Proteins Segments Sequences
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	1659
container_issue	5
container_start_page	1648
container_title	IEEE/ACM transactions on computational biology and bioinformatics
container_volume	17
creator	Ranjan, Ashish Fahad, Md Shah Fernandez-Baca, David Deepak, Akshay Tripathi, Sudhakar
description	The order of amino acids in a protein sequence enables the protein to acquire a conformation suitable for performing functions, thereby motivating the need to analyze these sequences for predicting functions. Although machine learning based approaches are fast compared to methods using BLAST, FASTA, etc., they fail to perform well for long protein sequences (with more than 300 amino acids). In this paper, we introduce a novel method for construction of two separate feature sets for protein using bi-directional long short-term memory network based on the analysis of fixed 1) single-sized segments and 2) multi-sized segments. The model trained on the proposed feature set based on multi-sized segments is combined with the model trained using state-of-the-art Multi-label Linear Discriminant Analysis (MLDA) features to further improve the accuracy. Extensive evaluations using separate datasets for biological processes and molecular functions demonstrate not only improved results for long sequences, but also significantly improve the overall accuracy over state-of-the-art method. The single-sized approach produces an improvement of +3.37 percent for biological processes and +5.48 percent for molecular functions over the MLDA based classifier. The corresponding numbers for multi-sized approach are +5.38 and +8.00 percent. Combining the two models, the accuracy further improves to +7.41 and +9.21 percent, respectively.
doi_str_mv	10.1109/TCBB.2019.2911609
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_pubmed_primary_30998479</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>8692646</ieee_id><sourcerecordid>2211951247</sourcerecordid><originalsourceid>FETCH-LOGICAL-c349t-57826b345e241c7f791dd4e89fb3f544ba22f7dada15542eb90852410ce278c53</originalsourceid><addsrcrecordid>eNpdkMtOwzAQRS0E4v0BCAlFYsMmxc8ks6SFAlIlEK8dspxkUlLauNiJEH-Po5YuWHnkOXfsOYScMDpgjMLly2g4HHDKYMCBsYTCFtlnSqUxQCK3-1qqWEEi9siB9zNKuQQqd8meoACZTGGfvF8jLqMnm3e-jcbOLPDbus-osi56dLbFuonGXVO0tW3CBZb1qnz1dTON3oyrTT7HeILNtP3YJJ7xq8OmQH9Edioz93i8Pg_J6_jmZXQXTx5u70dXk7gQEtpYpRlPciEVcsmKtEqBlaXEDKpcVErK3HBepaUpTdhOcsyBZiqgtECeZoUSh-RiNXfpbHjat3pR-wLnc9Og7bzmnDFQjMs0oOf_0JntXBN-p7mUIIJYJgLFVlThrPcOK7109cK4H82o7t3r3r3u3eu1-5A5W0_u8gWWm8Sf7ACcroAaETftLAGeyET8Alv0hzY</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2449311013</pqid></control><display><type>article</type><title>Deep Robust Framework for Protein Function Prediction Using Variable-Length Protein Sequences</title><source>IEEE Electronic Library (IEL)</source><creator>Ranjan, Ashish ; Fahad, Md Shah ; Fernandez-Baca, David ; Deepak, Akshay ; Tripathi, Sudhakar</creator><creatorcontrib>Ranjan, Ashish ; Fahad, Md Shah ; Fernandez-Baca, David ; Deepak, Akshay ; Tripathi, Sudhakar</creatorcontrib><description>The order of amino acids in a protein sequence enables the protein to acquire a conformation suitable for performing functions, thereby motivating the need to analyze these sequences for predicting functions. Although machine learning based approaches are fast compared to methods using BLAST, FASTA, etc., they fail to perform well for long protein sequences (with more than 300 amino acids). In this paper, we introduce a novel method for construction of two separate feature sets for protein using bi-directional long short-term memory network based on the analysis of fixed 1) single-sized segments and 2) multi-sized segments. The model trained on the proposed feature set based on multi-sized segments is combined with the model trained using state-of-the-art Multi-label Linear Discriminant Analysis (MLDA) features to further improve the accuracy. Extensive evaluations using separate datasets for biological processes and molecular functions demonstrate not only improved results for long sequences, but also significantly improve the overall accuracy over state-of-the-art method. The single-sized approach produces an improvement of +3.37 percent for biological processes and +5.48 percent for molecular functions over the MLDA based classifier. The corresponding numbers for multi-sized approach are +5.38 and +8.00 percent. Combining the two models, the accuracy further improves to +7.41 and +9.21 percent, respectively.</description><identifier>ISSN: 1545-5963</identifier><identifier>EISSN: 1557-9964</identifier><identifier>DOI: 10.1109/TCBB.2019.2911609</identifier><identifier>PMID: 30998479</identifier><identifier>CODEN: ITCBCY</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Accuracy ; Amino acid sequence ; Amino acids ; Bi-directional long short-term memory (Bi-LSTM) ; Bidirectional control ; Biological activity ; Biological processes ; Biological system modeling ; Discriminant analysis ; Learning algorithms ; long protein sequence ; Long short-term memory ; Machine learning ; Model accuracy ; multi-label linear discriminant analysis (MLDA) ; Organisms ; protein segment vector ; Protein sequence ; Protein structure ; Proteins ; Segments ; Sequences</subject><ispartof>IEEE/ACM transactions on computational biology and bioinformatics, 2020-09, Vol.17 (5), p.1648-1659</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c349t-57826b345e241c7f791dd4e89fb3f544ba22f7dada15542eb90852410ce278c53</citedby><cites>FETCH-LOGICAL-c349t-57826b345e241c7f791dd4e89fb3f544ba22f7dada15542eb90852410ce278c53</cites><orcidid>0000-0002-0091-1088 ; 0000-0001-6854-8599</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/8692646$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27903,27904,54737</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/8692646$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/30998479$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Ranjan, Ashish</creatorcontrib><creatorcontrib>Fahad, Md Shah</creatorcontrib><creatorcontrib>Fernandez-Baca, David</creatorcontrib><creatorcontrib>Deepak, Akshay</creatorcontrib><creatorcontrib>Tripathi, Sudhakar</creatorcontrib><title>Deep Robust Framework for Protein Function Prediction Using Variable-Length Protein Sequences</title><title>IEEE/ACM transactions on computational biology and bioinformatics</title><addtitle>TCBB</addtitle><addtitle>IEEE/ACM Trans Comput Biol Bioinform</addtitle><description>The order of amino acids in a protein sequence enables the protein to acquire a conformation suitable for performing functions, thereby motivating the need to analyze these sequences for predicting functions. Although machine learning based approaches are fast compared to methods using BLAST, FASTA, etc., they fail to perform well for long protein sequences (with more than 300 amino acids). In this paper, we introduce a novel method for construction of two separate feature sets for protein using bi-directional long short-term memory network based on the analysis of fixed 1) single-sized segments and 2) multi-sized segments. The model trained on the proposed feature set based on multi-sized segments is combined with the model trained using state-of-the-art Multi-label Linear Discriminant Analysis (MLDA) features to further improve the accuracy. Extensive evaluations using separate datasets for biological processes and molecular functions demonstrate not only improved results for long sequences, but also significantly improve the overall accuracy over state-of-the-art method. The single-sized approach produces an improvement of +3.37 percent for biological processes and +5.48 percent for molecular functions over the MLDA based classifier. The corresponding numbers for multi-sized approach are +5.38 and +8.00 percent. Combining the two models, the accuracy further improves to +7.41 and +9.21 percent, respectively.</description><subject>Accuracy</subject><subject>Amino acid sequence</subject><subject>Amino acids</subject><subject>Bi-directional long short-term memory (Bi-LSTM)</subject><subject>Bidirectional control</subject><subject>Biological activity</subject><subject>Biological processes</subject><subject>Biological system modeling</subject><subject>Discriminant analysis</subject><subject>Learning algorithms</subject><subject>long protein sequence</subject><subject>Long short-term memory</subject><subject>Machine learning</subject><subject>Model accuracy</subject><subject>multi-label linear discriminant analysis (MLDA)</subject><subject>Organisms</subject><subject>protein segment vector</subject><subject>Protein sequence</subject><subject>Protein structure</subject><subject>Proteins</subject><subject>Segments</subject><subject>Sequences</subject><issn>1545-5963</issn><issn>1557-9964</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpdkMtOwzAQRS0E4v0BCAlFYsMmxc8ks6SFAlIlEK8dspxkUlLauNiJEH-Po5YuWHnkOXfsOYScMDpgjMLly2g4HHDKYMCBsYTCFtlnSqUxQCK3-1qqWEEi9siB9zNKuQQqd8meoACZTGGfvF8jLqMnm3e-jcbOLPDbus-osi56dLbFuonGXVO0tW3CBZb1qnz1dTON3oyrTT7HeILNtP3YJJ7xq8OmQH9Edioz93i8Pg_J6_jmZXQXTx5u70dXk7gQEtpYpRlPciEVcsmKtEqBlaXEDKpcVErK3HBepaUpTdhOcsyBZiqgtECeZoUSh-RiNXfpbHjat3pR-wLnc9Og7bzmnDFQjMs0oOf_0JntXBN-p7mUIIJYJgLFVlThrPcOK7109cK4H82o7t3r3r3u3eu1-5A5W0_u8gWWm8Sf7ACcroAaETftLAGeyET8Alv0hzY</recordid><startdate>202009</startdate><enddate>202009</enddate><creator>Ranjan, Ashish</creator><creator>Fahad, Md Shah</creator><creator>Fernandez-Baca, David</creator><creator>Deepak, Akshay</creator><creator>Tripathi, Sudhakar</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QF</scope><scope>7QO</scope><scope>7QQ</scope><scope>7SC</scope><scope>7SE</scope><scope>7SP</scope><scope>7SR</scope><scope>7TA</scope><scope>7TB</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>F28</scope><scope>FR3</scope><scope>H8D</scope><scope>JG9</scope><scope>JQ2</scope><scope>KR7</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>P64</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-0091-1088</orcidid><orcidid>https://orcid.org/0000-0001-6854-8599</orcidid></search><sort><creationdate>202009</creationdate><title>Deep Robust Framework for Protein Function Prediction Using Variable-Length Protein Sequences</title><author>Ranjan, Ashish ; Fahad, Md Shah ; Fernandez-Baca, David ; Deepak, Akshay ; Tripathi, Sudhakar</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c349t-57826b345e241c7f791dd4e89fb3f544ba22f7dada15542eb90852410ce278c53</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Accuracy</topic><topic>Amino acid sequence</topic><topic>Amino acids</topic><topic>Bi-directional long short-term memory (Bi-LSTM)</topic><topic>Bidirectional control</topic><topic>Biological activity</topic><topic>Biological processes</topic><topic>Biological system modeling</topic><topic>Discriminant analysis</topic><topic>Learning algorithms</topic><topic>long protein sequence</topic><topic>Long short-term memory</topic><topic>Machine learning</topic><topic>Model accuracy</topic><topic>multi-label linear discriminant analysis (MLDA)</topic><topic>Organisms</topic><topic>protein segment vector</topic><topic>Protein sequence</topic><topic>Protein structure</topic><topic>Proteins</topic><topic>Segments</topic><topic>Sequences</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ranjan, Ashish</creatorcontrib><creatorcontrib>Fahad, Md Shah</creatorcontrib><creatorcontrib>Fernandez-Baca, David</creatorcontrib><creatorcontrib>Deepak, Akshay</creatorcontrib><creatorcontrib>Tripathi, Sudhakar</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Aluminium Industry Abstracts</collection><collection>Biotechnology Research Abstracts</collection><collection>Ceramic Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>Corrosion Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Materials Business File</collection><collection>Mechanical & Transportation Engineering Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>ANTE: Abstracts in New Technology & Engineering</collection><collection>Engineering Research Database</collection><collection>Aerospace Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Civil Engineering Abstracts</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE/ACM transactions on computational biology and bioinformatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Ranjan, Ashish</au><au>Fahad, Md Shah</au><au>Fernandez-Baca, David</au><au>Deepak, Akshay</au><au>Tripathi, Sudhakar</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Deep Robust Framework for Protein Function Prediction Using Variable-Length Protein Sequences</atitle><jtitle>IEEE/ACM transactions on computational biology and bioinformatics</jtitle><stitle>TCBB</stitle><addtitle>IEEE/ACM Trans Comput Biol Bioinform</addtitle><date>2020-09</date><risdate>2020</risdate><volume>17</volume><issue>5</issue><spage>1648</spage><epage>1659</epage><pages>1648-1659</pages><issn>1545-5963</issn><eissn>1557-9964</eissn><coden>ITCBCY</coden><abstract>The order of amino acids in a protein sequence enables the protein to acquire a conformation suitable for performing functions, thereby motivating the need to analyze these sequences for predicting functions. Although machine learning based approaches are fast compared to methods using BLAST, FASTA, etc., they fail to perform well for long protein sequences (with more than 300 amino acids). In this paper, we introduce a novel method for construction of two separate feature sets for protein using bi-directional long short-term memory network based on the analysis of fixed 1) single-sized segments and 2) multi-sized segments. The model trained on the proposed feature set based on multi-sized segments is combined with the model trained using state-of-the-art Multi-label Linear Discriminant Analysis (MLDA) features to further improve the accuracy. Extensive evaluations using separate datasets for biological processes and molecular functions demonstrate not only improved results for long sequences, but also significantly improve the overall accuracy over state-of-the-art method. The single-sized approach produces an improvement of +3.37 percent for biological processes and +5.48 percent for molecular functions over the MLDA based classifier. The corresponding numbers for multi-sized approach are +5.38 and +8.00 percent. Combining the two models, the accuracy further improves to +7.41 and +9.21 percent, respectively.</abstract><cop>United States</cop><pub>IEEE</pub><pmid>30998479</pmid><doi>10.1109/TCBB.2019.2911609</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0002-0091-1088</orcidid><orcidid>https://orcid.org/0000-0001-6854-8599</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1545-5963
ispartof	IEEE/ACM transactions on computational biology and bioinformatics, 2020-09, Vol.17 (5), p.1648-1659
issn	1545-5963 1557-9964
language	eng
recordid	cdi_pubmed_primary_30998479
source	IEEE Electronic Library (IEL)
subjects	Accuracy Amino acid sequence Amino acids Bi-directional long short-term memory (Bi-LSTM) Bidirectional control Biological activity Biological processes Biological system modeling Discriminant analysis Learning algorithms long protein sequence Long short-term memory Machine learning Model accuracy multi-label linear discriminant analysis (MLDA) Organisms protein segment vector Protein sequence Protein structure Proteins Segments Sequences
title	Deep Robust Framework for Protein Function Prediction Using Variable-Length Protein Sequences
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-21T15%3A03%3A28IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Deep%20Robust%20Framework%20for%20Protein%20Function%20Prediction%20Using%20Variable-Length%20Protein%20Sequences&rft.jtitle=IEEE/ACM%20transactions%20on%20computational%20biology%20and%20bioinformatics&rft.au=Ranjan,%20Ashish&rft.date=2020-09&rft.volume=17&rft.issue=5&rft.spage=1648&rft.epage=1659&rft.pages=1648-1659&rft.issn=1545-5963&rft.eissn=1557-9964&rft.coden=ITCBCY&rft_id=info:doi/10.1109/TCBB.2019.2911609&rft_dat=%3Cproquest_RIE%3E2211951247%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2449311013&rft_id=info:pmid/30998479&rft_ieee_id=8692646&rfr_iscdi=true