Deep Robust Framework for Protein Function Prediction Using Variable-Length Protein Sequences

The order of amino acids in a protein sequence enables the protein to acquire a conformation suitable for performing functions, thereby motivating the need to analyze these sequences for predicting functions. Although machine learning based approaches are fast compared to methods using BLAST, FASTA,...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE/ACM transactions on computational biology and bioinformatics 2020-09, Vol.17 (5), p.1648-1659
Hauptverfasser: Ranjan, Ashish, Fahad, Md Shah, Fernandez-Baca, David, Deepak, Akshay, Tripathi, Sudhakar
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1659
container_issue 5
container_start_page 1648
container_title IEEE/ACM transactions on computational biology and bioinformatics
container_volume 17
creator Ranjan, Ashish
Fahad, Md Shah
Fernandez-Baca, David
Deepak, Akshay
Tripathi, Sudhakar
description The order of amino acids in a protein sequence enables the protein to acquire a conformation suitable for performing functions, thereby motivating the need to analyze these sequences for predicting functions. Although machine learning based approaches are fast compared to methods using BLAST, FASTA, etc., they fail to perform well for long protein sequences (with more than 300 amino acids). In this paper, we introduce a novel method for construction of two separate feature sets for protein using bi-directional long short-term memory network based on the analysis of fixed 1) single-sized segments and 2) multi-sized segments. The model trained on the proposed feature set based on multi-sized segments is combined with the model trained using state-of-the-art Multi-label Linear Discriminant Analysis (MLDA) features to further improve the accuracy. Extensive evaluations using separate datasets for biological processes and molecular functions demonstrate not only improved results for long sequences, but also significantly improve the overall accuracy over state-of-the-art method. The single-sized approach produces an improvement of +3.37 percent for biological processes and +5.48 percent for molecular functions over the MLDA based classifier. The corresponding numbers for multi-sized approach are +5.38 and +8.00 percent. Combining the two models, the accuracy further improves to +7.41 and +9.21 percent, respectively.
doi_str_mv 10.1109/TCBB.2019.2911609
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_pubmed_primary_30998479</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>8692646</ieee_id><sourcerecordid>2211951247</sourcerecordid><originalsourceid>FETCH-LOGICAL-c349t-57826b345e241c7f791dd4e89fb3f544ba22f7dada15542eb90852410ce278c53</originalsourceid><addsrcrecordid>eNpdkMtOwzAQRS0E4v0BCAlFYsMmxc8ks6SFAlIlEK8dspxkUlLauNiJEH-Po5YuWHnkOXfsOYScMDpgjMLly2g4HHDKYMCBsYTCFtlnSqUxQCK3-1qqWEEi9siB9zNKuQQqd8meoACZTGGfvF8jLqMnm3e-jcbOLPDbus-osi56dLbFuonGXVO0tW3CBZb1qnz1dTON3oyrTT7HeILNtP3YJJ7xq8OmQH9Edioz93i8Pg_J6_jmZXQXTx5u70dXk7gQEtpYpRlPciEVcsmKtEqBlaXEDKpcVErK3HBepaUpTdhOcsyBZiqgtECeZoUSh-RiNXfpbHjat3pR-wLnc9Og7bzmnDFQjMs0oOf_0JntXBN-p7mUIIJYJgLFVlThrPcOK7109cK4H82o7t3r3r3u3eu1-5A5W0_u8gWWm8Sf7ACcroAaETftLAGeyET8Alv0hzY</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2449311013</pqid></control><display><type>article</type><title>Deep Robust Framework for Protein Function Prediction Using Variable-Length Protein Sequences</title><source>IEEE Electronic Library (IEL)</source><creator>Ranjan, Ashish ; Fahad, Md Shah ; Fernandez-Baca, David ; Deepak, Akshay ; Tripathi, Sudhakar</creator><creatorcontrib>Ranjan, Ashish ; Fahad, Md Shah ; Fernandez-Baca, David ; Deepak, Akshay ; Tripathi, Sudhakar</creatorcontrib><description>The order of amino acids in a protein sequence enables the protein to acquire a conformation suitable for performing functions, thereby motivating the need to analyze these sequences for predicting functions. Although machine learning based approaches are fast compared to methods using BLAST, FASTA, etc., they fail to perform well for long protein sequences (with more than 300 amino acids). In this paper, we introduce a novel method for construction of two separate feature sets for protein using bi-directional long short-term memory network based on the analysis of fixed 1) single-sized segments and 2) multi-sized segments. The model trained on the proposed feature set based on multi-sized segments is combined with the model trained using state-of-the-art Multi-label Linear Discriminant Analysis (MLDA) features to further improve the accuracy. Extensive evaluations using separate datasets for biological processes and molecular functions demonstrate not only improved results for long sequences, but also significantly improve the overall accuracy over state-of-the-art method. The single-sized approach produces an improvement of +3.37 percent for biological processes and +5.48 percent for molecular functions over the MLDA based classifier. The corresponding numbers for multi-sized approach are +5.38 and +8.00 percent. Combining the two models, the accuracy further improves to +7.41 and +9.21 percent, respectively.</description><identifier>ISSN: 1545-5963</identifier><identifier>EISSN: 1557-9964</identifier><identifier>DOI: 10.1109/TCBB.2019.2911609</identifier><identifier>PMID: 30998479</identifier><identifier>CODEN: ITCBCY</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Accuracy ; Amino acid sequence ; Amino acids ; Bi-directional long short-term memory (Bi-LSTM) ; Bidirectional control ; Biological activity ; Biological processes ; Biological system modeling ; Discriminant analysis ; Learning algorithms ; long protein sequence ; Long short-term memory ; Machine learning ; Model accuracy ; multi-label linear discriminant analysis (MLDA) ; Organisms ; protein segment vector ; Protein sequence ; Protein structure ; Proteins ; Segments ; Sequences</subject><ispartof>IEEE/ACM transactions on computational biology and bioinformatics, 2020-09, Vol.17 (5), p.1648-1659</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c349t-57826b345e241c7f791dd4e89fb3f544ba22f7dada15542eb90852410ce278c53</citedby><cites>FETCH-LOGICAL-c349t-57826b345e241c7f791dd4e89fb3f544ba22f7dada15542eb90852410ce278c53</cites><orcidid>0000-0002-0091-1088 ; 0000-0001-6854-8599</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/8692646$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27903,27904,54737</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/8692646$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/30998479$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Ranjan, Ashish</creatorcontrib><creatorcontrib>Fahad, Md Shah</creatorcontrib><creatorcontrib>Fernandez-Baca, David</creatorcontrib><creatorcontrib>Deepak, Akshay</creatorcontrib><creatorcontrib>Tripathi, Sudhakar</creatorcontrib><title>Deep Robust Framework for Protein Function Prediction Using Variable-Length Protein Sequences</title><title>IEEE/ACM transactions on computational biology and bioinformatics</title><addtitle>TCBB</addtitle><addtitle>IEEE/ACM Trans Comput Biol Bioinform</addtitle><description>The order of amino acids in a protein sequence enables the protein to acquire a conformation suitable for performing functions, thereby motivating the need to analyze these sequences for predicting functions. Although machine learning based approaches are fast compared to methods using BLAST, FASTA, etc., they fail to perform well for long protein sequences (with more than 300 amino acids). In this paper, we introduce a novel method for construction of two separate feature sets for protein using bi-directional long short-term memory network based on the analysis of fixed 1) single-sized segments and 2) multi-sized segments. The model trained on the proposed feature set based on multi-sized segments is combined with the model trained using state-of-the-art Multi-label Linear Discriminant Analysis (MLDA) features to further improve the accuracy. Extensive evaluations using separate datasets for biological processes and molecular functions demonstrate not only improved results for long sequences, but also significantly improve the overall accuracy over state-of-the-art method. The single-sized approach produces an improvement of +3.37 percent for biological processes and +5.48 percent for molecular functions over the MLDA based classifier. The corresponding numbers for multi-sized approach are +5.38 and +8.00 percent. Combining the two models, the accuracy further improves to +7.41 and +9.21 percent, respectively.</description><subject>Accuracy</subject><subject>Amino acid sequence</subject><subject>Amino acids</subject><subject>Bi-directional long short-term memory (Bi-LSTM)</subject><subject>Bidirectional control</subject><subject>Biological activity</subject><subject>Biological processes</subject><subject>Biological system modeling</subject><subject>Discriminant analysis</subject><subject>Learning algorithms</subject><subject>long protein sequence</subject><subject>Long short-term memory</subject><subject>Machine learning</subject><subject>Model accuracy</subject><subject>multi-label linear discriminant analysis (MLDA)</subject><subject>Organisms</subject><subject>protein segment vector</subject><subject>Protein sequence</subject><subject>Protein structure</subject><subject>Proteins</subject><subject>Segments</subject><subject>Sequences</subject><issn>1545-5963</issn><issn>1557-9964</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpdkMtOwzAQRS0E4v0BCAlFYsMmxc8ks6SFAlIlEK8dspxkUlLauNiJEH-Po5YuWHnkOXfsOYScMDpgjMLly2g4HHDKYMCBsYTCFtlnSqUxQCK3-1qqWEEi9siB9zNKuQQqd8meoACZTGGfvF8jLqMnm3e-jcbOLPDbus-osi56dLbFuonGXVO0tW3CBZb1qnz1dTON3oyrTT7HeILNtP3YJJ7xq8OmQH9Edioz93i8Pg_J6_jmZXQXTx5u70dXk7gQEtpYpRlPciEVcsmKtEqBlaXEDKpcVErK3HBepaUpTdhOcsyBZiqgtECeZoUSh-RiNXfpbHjat3pR-wLnc9Og7bzmnDFQjMs0oOf_0JntXBN-p7mUIIJYJgLFVlThrPcOK7109cK4H82o7t3r3r3u3eu1-5A5W0_u8gWWm8Sf7ACcroAaETftLAGeyET8Alv0hzY</recordid><startdate>202009</startdate><enddate>202009</enddate><creator>Ranjan, Ashish</creator><creator>Fahad, Md Shah</creator><creator>Fernandez-Baca, David</creator><creator>Deepak, Akshay</creator><creator>Tripathi, Sudhakar</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QF</scope><scope>7QO</scope><scope>7QQ</scope><scope>7SC</scope><scope>7SE</scope><scope>7SP</scope><scope>7SR</scope><scope>7TA</scope><scope>7TB</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>F28</scope><scope>FR3</scope><scope>H8D</scope><scope>JG9</scope><scope>JQ2</scope><scope>KR7</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>P64</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-0091-1088</orcidid><orcidid>https://orcid.org/0000-0001-6854-8599</orcidid></search><sort><creationdate>202009</creationdate><title>Deep Robust Framework for Protein Function Prediction Using Variable-Length Protein Sequences</title><author>Ranjan, Ashish ; Fahad, Md Shah ; Fernandez-Baca, David ; Deepak, Akshay ; Tripathi, Sudhakar</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c349t-57826b345e241c7f791dd4e89fb3f544ba22f7dada15542eb90852410ce278c53</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Accuracy</topic><topic>Amino acid sequence</topic><topic>Amino acids</topic><topic>Bi-directional long short-term memory (Bi-LSTM)</topic><topic>Bidirectional control</topic><topic>Biological activity</topic><topic>Biological processes</topic><topic>Biological system modeling</topic><topic>Discriminant analysis</topic><topic>Learning algorithms</topic><topic>long protein sequence</topic><topic>Long short-term memory</topic><topic>Machine learning</topic><topic>Model accuracy</topic><topic>multi-label linear discriminant analysis (MLDA)</topic><topic>Organisms</topic><topic>protein segment vector</topic><topic>Protein sequence</topic><topic>Protein structure</topic><topic>Proteins</topic><topic>Segments</topic><topic>Sequences</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ranjan, Ashish</creatorcontrib><creatorcontrib>Fahad, Md Shah</creatorcontrib><creatorcontrib>Fernandez-Baca, David</creatorcontrib><creatorcontrib>Deepak, Akshay</creatorcontrib><creatorcontrib>Tripathi, Sudhakar</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Aluminium Industry Abstracts</collection><collection>Biotechnology Research Abstracts</collection><collection>Ceramic Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>Corrosion Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Materials Business File</collection><collection>Mechanical &amp; Transportation Engineering Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>ANTE: Abstracts in New Technology &amp; Engineering</collection><collection>Engineering Research Database</collection><collection>Aerospace Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Civil Engineering Abstracts</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE/ACM transactions on computational biology and bioinformatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Ranjan, Ashish</au><au>Fahad, Md Shah</au><au>Fernandez-Baca, David</au><au>Deepak, Akshay</au><au>Tripathi, Sudhakar</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Deep Robust Framework for Protein Function Prediction Using Variable-Length Protein Sequences</atitle><jtitle>IEEE/ACM transactions on computational biology and bioinformatics</jtitle><stitle>TCBB</stitle><addtitle>IEEE/ACM Trans Comput Biol Bioinform</addtitle><date>2020-09</date><risdate>2020</risdate><volume>17</volume><issue>5</issue><spage>1648</spage><epage>1659</epage><pages>1648-1659</pages><issn>1545-5963</issn><eissn>1557-9964</eissn><coden>ITCBCY</coden><abstract>The order of amino acids in a protein sequence enables the protein to acquire a conformation suitable for performing functions, thereby motivating the need to analyze these sequences for predicting functions. Although machine learning based approaches are fast compared to methods using BLAST, FASTA, etc., they fail to perform well for long protein sequences (with more than 300 amino acids). In this paper, we introduce a novel method for construction of two separate feature sets for protein using bi-directional long short-term memory network based on the analysis of fixed 1) single-sized segments and 2) multi-sized segments. The model trained on the proposed feature set based on multi-sized segments is combined with the model trained using state-of-the-art Multi-label Linear Discriminant Analysis (MLDA) features to further improve the accuracy. Extensive evaluations using separate datasets for biological processes and molecular functions demonstrate not only improved results for long sequences, but also significantly improve the overall accuracy over state-of-the-art method. The single-sized approach produces an improvement of +3.37 percent for biological processes and +5.48 percent for molecular functions over the MLDA based classifier. The corresponding numbers for multi-sized approach are +5.38 and +8.00 percent. Combining the two models, the accuracy further improves to +7.41 and +9.21 percent, respectively.</abstract><cop>United States</cop><pub>IEEE</pub><pmid>30998479</pmid><doi>10.1109/TCBB.2019.2911609</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0002-0091-1088</orcidid><orcidid>https://orcid.org/0000-0001-6854-8599</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1545-5963
ispartof IEEE/ACM transactions on computational biology and bioinformatics, 2020-09, Vol.17 (5), p.1648-1659
issn 1545-5963
1557-9964
language eng
recordid cdi_pubmed_primary_30998479
source IEEE Electronic Library (IEL)
subjects Accuracy
Amino acid sequence
Amino acids
Bi-directional long short-term memory (Bi-LSTM)
Bidirectional control
Biological activity
Biological processes
Biological system modeling
Discriminant analysis
Learning algorithms
long protein sequence
Long short-term memory
Machine learning
Model accuracy
multi-label linear discriminant analysis (MLDA)
Organisms
protein segment vector
Protein sequence
Protein structure
Proteins
Segments
Sequences
title Deep Robust Framework for Protein Function Prediction Using Variable-Length Protein Sequences
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-21T15%3A03%3A28IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Deep%20Robust%20Framework%20for%20Protein%20Function%20Prediction%20Using%20Variable-Length%20Protein%20Sequences&rft.jtitle=IEEE/ACM%20transactions%20on%20computational%20biology%20and%20bioinformatics&rft.au=Ranjan,%20Ashish&rft.date=2020-09&rft.volume=17&rft.issue=5&rft.spage=1648&rft.epage=1659&rft.pages=1648-1659&rft.issn=1545-5963&rft.eissn=1557-9964&rft.coden=ITCBCY&rft_id=info:doi/10.1109/TCBB.2019.2911609&rft_dat=%3Cproquest_RIE%3E2211951247%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2449311013&rft_id=info:pmid/30998479&rft_ieee_id=8692646&rfr_iscdi=true