Deep Robust Framework for Protein Function Prediction Using Variable-Length Protein Sequences
The order of amino acids in a protein sequence enables the protein to acquire a conformation suitable for performing functions, thereby motivating the need to analyze these sequences for predicting functions. Although machine learning based approaches are fast compared to methods using BLAST, FASTA,...
Gespeichert in:
Veröffentlicht in: | IEEE/ACM transactions on computational biology and bioinformatics 2020-09, Vol.17 (5), p.1648-1659 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 1659 |
---|---|
container_issue | 5 |
container_start_page | 1648 |
container_title | IEEE/ACM transactions on computational biology and bioinformatics |
container_volume | 17 |
creator | Ranjan, Ashish Fahad, Md Shah Fernandez-Baca, David Deepak, Akshay Tripathi, Sudhakar |
description | The order of amino acids in a protein sequence enables the protein to acquire a conformation suitable for performing functions, thereby motivating the need to analyze these sequences for predicting functions. Although machine learning based approaches are fast compared to methods using BLAST, FASTA, etc., they fail to perform well for long protein sequences (with more than 300 amino acids). In this paper, we introduce a novel method for construction of two separate feature sets for protein using bi-directional long short-term memory network based on the analysis of fixed 1) single-sized segments and 2) multi-sized segments. The model trained on the proposed feature set based on multi-sized segments is combined with the model trained using state-of-the-art Multi-label Linear Discriminant Analysis (MLDA) features to further improve the accuracy. Extensive evaluations using separate datasets for biological processes and molecular functions demonstrate not only improved results for long sequences, but also significantly improve the overall accuracy over state-of-the-art method. The single-sized approach produces an improvement of +3.37 percent for biological processes and +5.48 percent for molecular functions over the MLDA based classifier. The corresponding numbers for multi-sized approach are +5.38 and +8.00 percent. Combining the two models, the accuracy further improves to +7.41 and +9.21 percent, respectively. |
doi_str_mv | 10.1109/TCBB.2019.2911609 |
format | Article |
fullrecord | <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_pubmed_primary_30998479</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>8692646</ieee_id><sourcerecordid>2211951247</sourcerecordid><originalsourceid>FETCH-LOGICAL-c349t-57826b345e241c7f791dd4e89fb3f544ba22f7dada15542eb90852410ce278c53</originalsourceid><addsrcrecordid>eNpdkMtOwzAQRS0E4v0BCAlFYsMmxc8ks6SFAlIlEK8dspxkUlLauNiJEH-Po5YuWHnkOXfsOYScMDpgjMLly2g4HHDKYMCBsYTCFtlnSqUxQCK3-1qqWEEi9siB9zNKuQQqd8meoACZTGGfvF8jLqMnm3e-jcbOLPDbus-osi56dLbFuonGXVO0tW3CBZb1qnz1dTON3oyrTT7HeILNtP3YJJ7xq8OmQH9Edioz93i8Pg_J6_jmZXQXTx5u70dXk7gQEtpYpRlPciEVcsmKtEqBlaXEDKpcVErK3HBepaUpTdhOcsyBZiqgtECeZoUSh-RiNXfpbHjat3pR-wLnc9Og7bzmnDFQjMs0oOf_0JntXBN-p7mUIIJYJgLFVlThrPcOK7109cK4H82o7t3r3r3u3eu1-5A5W0_u8gWWm8Sf7ACcroAaETftLAGeyET8Alv0hzY</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2449311013</pqid></control><display><type>article</type><title>Deep Robust Framework for Protein Function Prediction Using Variable-Length Protein Sequences</title><source>IEEE Electronic Library (IEL)</source><creator>Ranjan, Ashish ; Fahad, Md Shah ; Fernandez-Baca, David ; Deepak, Akshay ; Tripathi, Sudhakar</creator><creatorcontrib>Ranjan, Ashish ; Fahad, Md Shah ; Fernandez-Baca, David ; Deepak, Akshay ; Tripathi, Sudhakar</creatorcontrib><description>The order of amino acids in a protein sequence enables the protein to acquire a conformation suitable for performing functions, thereby motivating the need to analyze these sequences for predicting functions. Although machine learning based approaches are fast compared to methods using BLAST, FASTA, etc., they fail to perform well for long protein sequences (with more than 300 amino acids). In this paper, we introduce a novel method for construction of two separate feature sets for protein using bi-directional long short-term memory network based on the analysis of fixed 1) single-sized segments and 2) multi-sized segments. The model trained on the proposed feature set based on multi-sized segments is combined with the model trained using state-of-the-art Multi-label Linear Discriminant Analysis (MLDA) features to further improve the accuracy. Extensive evaluations using separate datasets for biological processes and molecular functions demonstrate not only improved results for long sequences, but also significantly improve the overall accuracy over state-of-the-art method. The single-sized approach produces an improvement of +3.37 percent for biological processes and +5.48 percent for molecular functions over the MLDA based classifier. The corresponding numbers for multi-sized approach are +5.38 and +8.00 percent. Combining the two models, the accuracy further improves to +7.41 and +9.21 percent, respectively.</description><identifier>ISSN: 1545-5963</identifier><identifier>EISSN: 1557-9964</identifier><identifier>DOI: 10.1109/TCBB.2019.2911609</identifier><identifier>PMID: 30998479</identifier><identifier>CODEN: ITCBCY</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Accuracy ; Amino acid sequence ; Amino acids ; Bi-directional long short-term memory (Bi-LSTM) ; Bidirectional control ; Biological activity ; Biological processes ; Biological system modeling ; Discriminant analysis ; Learning algorithms ; long protein sequence ; Long short-term memory ; Machine learning ; Model accuracy ; multi-label linear discriminant analysis (MLDA) ; Organisms ; protein segment vector ; Protein sequence ; Protein structure ; Proteins ; Segments ; Sequences</subject><ispartof>IEEE/ACM transactions on computational biology and bioinformatics, 2020-09, Vol.17 (5), p.1648-1659</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c349t-57826b345e241c7f791dd4e89fb3f544ba22f7dada15542eb90852410ce278c53</citedby><cites>FETCH-LOGICAL-c349t-57826b345e241c7f791dd4e89fb3f544ba22f7dada15542eb90852410ce278c53</cites><orcidid>0000-0002-0091-1088 ; 0000-0001-6854-8599</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/8692646$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27903,27904,54737</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/8692646$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/30998479$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Ranjan, Ashish</creatorcontrib><creatorcontrib>Fahad, Md Shah</creatorcontrib><creatorcontrib>Fernandez-Baca, David</creatorcontrib><creatorcontrib>Deepak, Akshay</creatorcontrib><creatorcontrib>Tripathi, Sudhakar</creatorcontrib><title>Deep Robust Framework for Protein Function Prediction Using Variable-Length Protein Sequences</title><title>IEEE/ACM transactions on computational biology and bioinformatics</title><addtitle>TCBB</addtitle><addtitle>IEEE/ACM Trans Comput Biol Bioinform</addtitle><description>The order of amino acids in a protein sequence enables the protein to acquire a conformation suitable for performing functions, thereby motivating the need to analyze these sequences for predicting functions. Although machine learning based approaches are fast compared to methods using BLAST, FASTA, etc., they fail to perform well for long protein sequences (with more than 300 amino acids). In this paper, we introduce a novel method for construction of two separate feature sets for protein using bi-directional long short-term memory network based on the analysis of fixed 1) single-sized segments and 2) multi-sized segments. The model trained on the proposed feature set based on multi-sized segments is combined with the model trained using state-of-the-art Multi-label Linear Discriminant Analysis (MLDA) features to further improve the accuracy. Extensive evaluations using separate datasets for biological processes and molecular functions demonstrate not only improved results for long sequences, but also significantly improve the overall accuracy over state-of-the-art method. The single-sized approach produces an improvement of +3.37 percent for biological processes and +5.48 percent for molecular functions over the MLDA based classifier. The corresponding numbers for multi-sized approach are +5.38 and +8.00 percent. Combining the two models, the accuracy further improves to +7.41 and +9.21 percent, respectively.</description><subject>Accuracy</subject><subject>Amino acid sequence</subject><subject>Amino acids</subject><subject>Bi-directional long short-term memory (Bi-LSTM)</subject><subject>Bidirectional control</subject><subject>Biological activity</subject><subject>Biological processes</subject><subject>Biological system modeling</subject><subject>Discriminant analysis</subject><subject>Learning algorithms</subject><subject>long protein sequence</subject><subject>Long short-term memory</subject><subject>Machine learning</subject><subject>Model accuracy</subject><subject>multi-label linear discriminant analysis (MLDA)</subject><subject>Organisms</subject><subject>protein segment vector</subject><subject>Protein sequence</subject><subject>Protein structure</subject><subject>Proteins</subject><subject>Segments</subject><subject>Sequences</subject><issn>1545-5963</issn><issn>1557-9964</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpdkMtOwzAQRS0E4v0BCAlFYsMmxc8ks6SFAlIlEK8dspxkUlLauNiJEH-Po5YuWHnkOXfsOYScMDpgjMLly2g4HHDKYMCBsYTCFtlnSqUxQCK3-1qqWEEi9siB9zNKuQQqd8meoACZTGGfvF8jLqMnm3e-jcbOLPDbus-osi56dLbFuonGXVO0tW3CBZb1qnz1dTON3oyrTT7HeILNtP3YJJ7xq8OmQH9Edioz93i8Pg_J6_jmZXQXTx5u70dXk7gQEtpYpRlPciEVcsmKtEqBlaXEDKpcVErK3HBepaUpTdhOcsyBZiqgtECeZoUSh-RiNXfpbHjat3pR-wLnc9Og7bzmnDFQjMs0oOf_0JntXBN-p7mUIIJYJgLFVlThrPcOK7109cK4H82o7t3r3r3u3eu1-5A5W0_u8gWWm8Sf7ACcroAaETftLAGeyET8Alv0hzY</recordid><startdate>202009</startdate><enddate>202009</enddate><creator>Ranjan, Ashish</creator><creator>Fahad, Md Shah</creator><creator>Fernandez-Baca, David</creator><creator>Deepak, Akshay</creator><creator>Tripathi, Sudhakar</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QF</scope><scope>7QO</scope><scope>7QQ</scope><scope>7SC</scope><scope>7SE</scope><scope>7SP</scope><scope>7SR</scope><scope>7TA</scope><scope>7TB</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>F28</scope><scope>FR3</scope><scope>H8D</scope><scope>JG9</scope><scope>JQ2</scope><scope>KR7</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>P64</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-0091-1088</orcidid><orcidid>https://orcid.org/0000-0001-6854-8599</orcidid></search><sort><creationdate>202009</creationdate><title>Deep Robust Framework for Protein Function Prediction Using Variable-Length Protein Sequences</title><author>Ranjan, Ashish ; Fahad, Md Shah ; Fernandez-Baca, David ; Deepak, Akshay ; Tripathi, Sudhakar</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c349t-57826b345e241c7f791dd4e89fb3f544ba22f7dada15542eb90852410ce278c53</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Accuracy</topic><topic>Amino acid sequence</topic><topic>Amino acids</topic><topic>Bi-directional long short-term memory (Bi-LSTM)</topic><topic>Bidirectional control</topic><topic>Biological activity</topic><topic>Biological processes</topic><topic>Biological system modeling</topic><topic>Discriminant analysis</topic><topic>Learning algorithms</topic><topic>long protein sequence</topic><topic>Long short-term memory</topic><topic>Machine learning</topic><topic>Model accuracy</topic><topic>multi-label linear discriminant analysis (MLDA)</topic><topic>Organisms</topic><topic>protein segment vector</topic><topic>Protein sequence</topic><topic>Protein structure</topic><topic>Proteins</topic><topic>Segments</topic><topic>Sequences</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ranjan, Ashish</creatorcontrib><creatorcontrib>Fahad, Md Shah</creatorcontrib><creatorcontrib>Fernandez-Baca, David</creatorcontrib><creatorcontrib>Deepak, Akshay</creatorcontrib><creatorcontrib>Tripathi, Sudhakar</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Aluminium Industry Abstracts</collection><collection>Biotechnology Research Abstracts</collection><collection>Ceramic Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>Corrosion Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Materials Business File</collection><collection>Mechanical & Transportation Engineering Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>ANTE: Abstracts in New Technology & Engineering</collection><collection>Engineering Research Database</collection><collection>Aerospace Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Civil Engineering Abstracts</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE/ACM transactions on computational biology and bioinformatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Ranjan, Ashish</au><au>Fahad, Md Shah</au><au>Fernandez-Baca, David</au><au>Deepak, Akshay</au><au>Tripathi, Sudhakar</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Deep Robust Framework for Protein Function Prediction Using Variable-Length Protein Sequences</atitle><jtitle>IEEE/ACM transactions on computational biology and bioinformatics</jtitle><stitle>TCBB</stitle><addtitle>IEEE/ACM Trans Comput Biol Bioinform</addtitle><date>2020-09</date><risdate>2020</risdate><volume>17</volume><issue>5</issue><spage>1648</spage><epage>1659</epage><pages>1648-1659</pages><issn>1545-5963</issn><eissn>1557-9964</eissn><coden>ITCBCY</coden><abstract>The order of amino acids in a protein sequence enables the protein to acquire a conformation suitable for performing functions, thereby motivating the need to analyze these sequences for predicting functions. Although machine learning based approaches are fast compared to methods using BLAST, FASTA, etc., they fail to perform well for long protein sequences (with more than 300 amino acids). In this paper, we introduce a novel method for construction of two separate feature sets for protein using bi-directional long short-term memory network based on the analysis of fixed 1) single-sized segments and 2) multi-sized segments. The model trained on the proposed feature set based on multi-sized segments is combined with the model trained using state-of-the-art Multi-label Linear Discriminant Analysis (MLDA) features to further improve the accuracy. Extensive evaluations using separate datasets for biological processes and molecular functions demonstrate not only improved results for long sequences, but also significantly improve the overall accuracy over state-of-the-art method. The single-sized approach produces an improvement of +3.37 percent for biological processes and +5.48 percent for molecular functions over the MLDA based classifier. The corresponding numbers for multi-sized approach are +5.38 and +8.00 percent. Combining the two models, the accuracy further improves to +7.41 and +9.21 percent, respectively.</abstract><cop>United States</cop><pub>IEEE</pub><pmid>30998479</pmid><doi>10.1109/TCBB.2019.2911609</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0002-0091-1088</orcidid><orcidid>https://orcid.org/0000-0001-6854-8599</orcidid></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 1545-5963 |
ispartof | IEEE/ACM transactions on computational biology and bioinformatics, 2020-09, Vol.17 (5), p.1648-1659 |
issn | 1545-5963 1557-9964 |
language | eng |
recordid | cdi_pubmed_primary_30998479 |
source | IEEE Electronic Library (IEL) |
subjects | Accuracy Amino acid sequence Amino acids Bi-directional long short-term memory (Bi-LSTM) Bidirectional control Biological activity Biological processes Biological system modeling Discriminant analysis Learning algorithms long protein sequence Long short-term memory Machine learning Model accuracy multi-label linear discriminant analysis (MLDA) Organisms protein segment vector Protein sequence Protein structure Proteins Segments Sequences |
title | Deep Robust Framework for Protein Function Prediction Using Variable-Length Protein Sequences |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-21T15%3A03%3A28IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Deep%20Robust%20Framework%20for%20Protein%20Function%20Prediction%20Using%20Variable-Length%20Protein%20Sequences&rft.jtitle=IEEE/ACM%20transactions%20on%20computational%20biology%20and%20bioinformatics&rft.au=Ranjan,%20Ashish&rft.date=2020-09&rft.volume=17&rft.issue=5&rft.spage=1648&rft.epage=1659&rft.pages=1648-1659&rft.issn=1545-5963&rft.eissn=1557-9964&rft.coden=ITCBCY&rft_id=info:doi/10.1109/TCBB.2019.2911609&rft_dat=%3Cproquest_RIE%3E2211951247%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2449311013&rft_id=info:pmid/30998479&rft_ieee_id=8692646&rfr_iscdi=true |