Prediction of protein N-terminal acetylation modification sites based on CNN-BiLSTM-attention model
N-terminal acetylation is one of the most common and important post-translational modifications (PTM) of eukaryotic proteins. PTM plays a crucial role in various cellular processes and disease pathogenesis. Thus, the accurate identification of N-terminal acetylation modifications is important to gai...
Gespeichert in:
Veröffentlicht in: | Computers in biology and medicine 2024-05, Vol.174, p.108330, Article 108330 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | 108330 |
container_title | Computers in biology and medicine |
container_volume | 174 |
creator | Ke, Jinsong Zhao, Jianmei Li, Hongfei Yuan, Lei Dong, Guanghui Wang, Guohua |
description | N-terminal acetylation is one of the most common and important post-translational modifications (PTM) of eukaryotic proteins. PTM plays a crucial role in various cellular processes and disease pathogenesis. Thus, the accurate identification of N-terminal acetylation modifications is important to gain insight into cellular processes and other possible functional mechanisms. Although some algorithmic models have been proposed, most have been developed based on traditional machine learning algorithms and small training datasets. Their practical applications are limited. Nevertheless, deep learning algorithmic models are better at handling high-throughput and complex data. In this study, DeepCBA, a model based on the hybrid framework of convolutional neural network (CNN), bidirectional long short-term memory network (BiLSTM), and attention mechanism deep learning, was constructed to detect the N-terminal acetylation sites. The DeepCBA was built as follows: First, a benchmark dataset was generated by selecting low-redundant protein sequences from the Uniport database and further reducing the redundancy of the protein sequences using the CD-HIT tool. Subsequently, based on the skip-gram model in the word2vec algorithm, tripeptide word vector features were generated on the benchmark dataset. Finally, the CNN, BiLSTM, and attention mechanism were combined, and the tripeptide word vector features were fed into the stacked model for multiple rounds of training. The model performed excellently on independent dataset test, with accuracy and area under the curve of 80.51% and 87.36%, respectively. Altogether, DeepCBA achieved superior performance compared with the baseline model, and significantly outperformed most existing predictors. Additionally, our model can be used to identify disease loci and drug targets.
•Encoding protein sequences into tripeptide word vectors using the word2vec algorithm.•The attention mechanism improved the recognition of acetylation sites.•The acetylated proteins identified by the model are associated with diseases. |
doi_str_mv | 10.1016/j.compbiomed.2024.108330 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_3035076015</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0010482524004141</els_id><sourcerecordid>3046570730</sourcerecordid><originalsourceid>FETCH-LOGICAL-c262t-71a3a50eaf479ca07b8158c693c41310fdce9ff52d78bf611371473e0ce1d0053</originalsourceid><addsrcrecordid>eNqFkdFr2zAQxsVoWdJs_8Iw9KUvzk6WbSmPa-i2QpYWmj0LWT6Bgm1lklLIfz95Thj0pU_idL_vju8-QjIKSwq0_rpfatcfGut6bJcFFGX6FozBBzKngq9yqFh5ReYAFPJSFNWM3ISwB4ASGHwkMyYqIWrK50Q_e2ytjtYNmTPZwbuIdsi2eUTf20F1mdIYT536R_SutcbqqQg2YsgaFbDNUrnebvN7u3nZ_cpVjDhcBNh9ItdGdQE_n98F-f39Ybf-mW-efjyuv21yXdRFzDlVTFWAypR8pRXwRtBK6HrFdEkZBdNqXBlTFS0XjakpZZyWnCFopC0kywtyN81NLv4cMUTZ26Cx69SA7hgkA1YBr4GO6O0bdO-OPtkdqbKuOPBEL4iYKO1dCB6NPHjbK3-SFOQYhNzL_0HIMQg5BZGkX84Ljs3Yuwgvl0_A_QRgusirRS-DtjjolIZHHWXr7Ptb_gKnpp3q</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3046570730</pqid></control><display><type>article</type><title>Prediction of protein N-terminal acetylation modification sites based on CNN-BiLSTM-attention model</title><source>MEDLINE</source><source>Elsevier ScienceDirect Journals</source><creator>Ke, Jinsong ; Zhao, Jianmei ; Li, Hongfei ; Yuan, Lei ; Dong, Guanghui ; Wang, Guohua</creator><creatorcontrib>Ke, Jinsong ; Zhao, Jianmei ; Li, Hongfei ; Yuan, Lei ; Dong, Guanghui ; Wang, Guohua</creatorcontrib><description>N-terminal acetylation is one of the most common and important post-translational modifications (PTM) of eukaryotic proteins. PTM plays a crucial role in various cellular processes and disease pathogenesis. Thus, the accurate identification of N-terminal acetylation modifications is important to gain insight into cellular processes and other possible functional mechanisms. Although some algorithmic models have been proposed, most have been developed based on traditional machine learning algorithms and small training datasets. Their practical applications are limited. Nevertheless, deep learning algorithmic models are better at handling high-throughput and complex data. In this study, DeepCBA, a model based on the hybrid framework of convolutional neural network (CNN), bidirectional long short-term memory network (BiLSTM), and attention mechanism deep learning, was constructed to detect the N-terminal acetylation sites. The DeepCBA was built as follows: First, a benchmark dataset was generated by selecting low-redundant protein sequences from the Uniport database and further reducing the redundancy of the protein sequences using the CD-HIT tool. Subsequently, based on the skip-gram model in the word2vec algorithm, tripeptide word vector features were generated on the benchmark dataset. Finally, the CNN, BiLSTM, and attention mechanism were combined, and the tripeptide word vector features were fed into the stacked model for multiple rounds of training. The model performed excellently on independent dataset test, with accuracy and area under the curve of 80.51% and 87.36%, respectively. Altogether, DeepCBA achieved superior performance compared with the baseline model, and significantly outperformed most existing predictors. Additionally, our model can be used to identify disease loci and drug targets.
•Encoding protein sequences into tripeptide word vectors using the word2vec algorithm.•The attention mechanism improved the recognition of acetylation sites.•The acetylated proteins identified by the model are associated with diseases.</description><identifier>ISSN: 0010-4825</identifier><identifier>ISSN: 1879-0534</identifier><identifier>EISSN: 1879-0534</identifier><identifier>DOI: 10.1016/j.compbiomed.2024.108330</identifier><identifier>PMID: 38588617</identifier><language>eng</language><publisher>United States: Elsevier Ltd</publisher><subject>Acetylation ; Algorithms ; Amino acids ; Artificial neural networks ; Attention ; Benchmarks ; BiLSTM ; CNN ; Databases, Protein ; Datasets ; Decision trees ; Deep Learning ; Enzymes ; Humans ; KEGG ; Localization ; Long short-term memory ; Machine learning ; N-terminal acetylation ; Neural networks ; Neural Networks, Computer ; Pathogenesis ; Post-translation ; Protein Processing, Post-Translational ; Proteins ; Proteins - chemistry ; Proteins - metabolism ; Redundancy ; Support vector machines ; Therapeutic targets ; Tripeptide word vectors</subject><ispartof>Computers in biology and medicine, 2024-05, Vol.174, p.108330, Article 108330</ispartof><rights>2024</rights><rights>Copyright © 2024. Published by Elsevier Ltd.</rights><rights>Copyright Elsevier Limited May 2024</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c262t-71a3a50eaf479ca07b8158c693c41310fdce9ff52d78bf611371473e0ce1d0053</cites><orcidid>0009-0009-3399-9301</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.sciencedirect.com/science/article/pii/S0010482524004141$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>314,776,780,3537,27901,27902,65306</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/38588617$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Ke, Jinsong</creatorcontrib><creatorcontrib>Zhao, Jianmei</creatorcontrib><creatorcontrib>Li, Hongfei</creatorcontrib><creatorcontrib>Yuan, Lei</creatorcontrib><creatorcontrib>Dong, Guanghui</creatorcontrib><creatorcontrib>Wang, Guohua</creatorcontrib><title>Prediction of protein N-terminal acetylation modification sites based on CNN-BiLSTM-attention model</title><title>Computers in biology and medicine</title><addtitle>Comput Biol Med</addtitle><description>N-terminal acetylation is one of the most common and important post-translational modifications (PTM) of eukaryotic proteins. PTM plays a crucial role in various cellular processes and disease pathogenesis. Thus, the accurate identification of N-terminal acetylation modifications is important to gain insight into cellular processes and other possible functional mechanisms. Although some algorithmic models have been proposed, most have been developed based on traditional machine learning algorithms and small training datasets. Their practical applications are limited. Nevertheless, deep learning algorithmic models are better at handling high-throughput and complex data. In this study, DeepCBA, a model based on the hybrid framework of convolutional neural network (CNN), bidirectional long short-term memory network (BiLSTM), and attention mechanism deep learning, was constructed to detect the N-terminal acetylation sites. The DeepCBA was built as follows: First, a benchmark dataset was generated by selecting low-redundant protein sequences from the Uniport database and further reducing the redundancy of the protein sequences using the CD-HIT tool. Subsequently, based on the skip-gram model in the word2vec algorithm, tripeptide word vector features were generated on the benchmark dataset. Finally, the CNN, BiLSTM, and attention mechanism were combined, and the tripeptide word vector features were fed into the stacked model for multiple rounds of training. The model performed excellently on independent dataset test, with accuracy and area under the curve of 80.51% and 87.36%, respectively. Altogether, DeepCBA achieved superior performance compared with the baseline model, and significantly outperformed most existing predictors. Additionally, our model can be used to identify disease loci and drug targets.
•Encoding protein sequences into tripeptide word vectors using the word2vec algorithm.•The attention mechanism improved the recognition of acetylation sites.•The acetylated proteins identified by the model are associated with diseases.</description><subject>Acetylation</subject><subject>Algorithms</subject><subject>Amino acids</subject><subject>Artificial neural networks</subject><subject>Attention</subject><subject>Benchmarks</subject><subject>BiLSTM</subject><subject>CNN</subject><subject>Databases, Protein</subject><subject>Datasets</subject><subject>Decision trees</subject><subject>Deep Learning</subject><subject>Enzymes</subject><subject>Humans</subject><subject>KEGG</subject><subject>Localization</subject><subject>Long short-term memory</subject><subject>Machine learning</subject><subject>N-terminal acetylation</subject><subject>Neural networks</subject><subject>Neural Networks, Computer</subject><subject>Pathogenesis</subject><subject>Post-translation</subject><subject>Protein Processing, Post-Translational</subject><subject>Proteins</subject><subject>Proteins - chemistry</subject><subject>Proteins - metabolism</subject><subject>Redundancy</subject><subject>Support vector machines</subject><subject>Therapeutic targets</subject><subject>Tripeptide word vectors</subject><issn>0010-4825</issn><issn>1879-0534</issn><issn>1879-0534</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><recordid>eNqFkdFr2zAQxsVoWdJs_8Iw9KUvzk6WbSmPa-i2QpYWmj0LWT6Bgm1lklLIfz95Thj0pU_idL_vju8-QjIKSwq0_rpfatcfGut6bJcFFGX6FozBBzKngq9yqFh5ReYAFPJSFNWM3ISwB4ASGHwkMyYqIWrK50Q_e2ytjtYNmTPZwbuIdsi2eUTf20F1mdIYT536R_SutcbqqQg2YsgaFbDNUrnebvN7u3nZ_cpVjDhcBNh9ItdGdQE_n98F-f39Ybf-mW-efjyuv21yXdRFzDlVTFWAypR8pRXwRtBK6HrFdEkZBdNqXBlTFS0XjakpZZyWnCFopC0kywtyN81NLv4cMUTZ26Cx69SA7hgkA1YBr4GO6O0bdO-OPtkdqbKuOPBEL4iYKO1dCB6NPHjbK3-SFOQYhNzL_0HIMQg5BZGkX84Ljs3Yuwgvl0_A_QRgusirRS-DtjjolIZHHWXr7Ptb_gKnpp3q</recordid><startdate>202405</startdate><enddate>202405</enddate><creator>Ke, Jinsong</creator><creator>Zhao, Jianmei</creator><creator>Li, Hongfei</creator><creator>Yuan, Lei</creator><creator>Dong, Guanghui</creator><creator>Wang, Guohua</creator><general>Elsevier Ltd</general><general>Elsevier Limited</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>8FD</scope><scope>FR3</scope><scope>JQ2</scope><scope>K9.</scope><scope>M7Z</scope><scope>NAPCQ</scope><scope>P64</scope><scope>7X8</scope><orcidid>https://orcid.org/0009-0009-3399-9301</orcidid></search><sort><creationdate>202405</creationdate><title>Prediction of protein N-terminal acetylation modification sites based on CNN-BiLSTM-attention model</title><author>Ke, Jinsong ; Zhao, Jianmei ; Li, Hongfei ; Yuan, Lei ; Dong, Guanghui ; Wang, Guohua</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c262t-71a3a50eaf479ca07b8158c693c41310fdce9ff52d78bf611371473e0ce1d0053</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Acetylation</topic><topic>Algorithms</topic><topic>Amino acids</topic><topic>Artificial neural networks</topic><topic>Attention</topic><topic>Benchmarks</topic><topic>BiLSTM</topic><topic>CNN</topic><topic>Databases, Protein</topic><topic>Datasets</topic><topic>Decision trees</topic><topic>Deep Learning</topic><topic>Enzymes</topic><topic>Humans</topic><topic>KEGG</topic><topic>Localization</topic><topic>Long short-term memory</topic><topic>Machine learning</topic><topic>N-terminal acetylation</topic><topic>Neural networks</topic><topic>Neural Networks, Computer</topic><topic>Pathogenesis</topic><topic>Post-translation</topic><topic>Protein Processing, Post-Translational</topic><topic>Proteins</topic><topic>Proteins - chemistry</topic><topic>Proteins - metabolism</topic><topic>Redundancy</topic><topic>Support vector machines</topic><topic>Therapeutic targets</topic><topic>Tripeptide word vectors</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ke, Jinsong</creatorcontrib><creatorcontrib>Zhao, Jianmei</creatorcontrib><creatorcontrib>Li, Hongfei</creatorcontrib><creatorcontrib>Yuan, Lei</creatorcontrib><creatorcontrib>Dong, Guanghui</creatorcontrib><creatorcontrib>Wang, Guohua</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Technology Research Database</collection><collection>Engineering Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><collection>Biochemistry Abstracts 1</collection><collection>Nursing & Allied Health Premium</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><jtitle>Computers in biology and medicine</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ke, Jinsong</au><au>Zhao, Jianmei</au><au>Li, Hongfei</au><au>Yuan, Lei</au><au>Dong, Guanghui</au><au>Wang, Guohua</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Prediction of protein N-terminal acetylation modification sites based on CNN-BiLSTM-attention model</atitle><jtitle>Computers in biology and medicine</jtitle><addtitle>Comput Biol Med</addtitle><date>2024-05</date><risdate>2024</risdate><volume>174</volume><spage>108330</spage><pages>108330-</pages><artnum>108330</artnum><issn>0010-4825</issn><issn>1879-0534</issn><eissn>1879-0534</eissn><abstract>N-terminal acetylation is one of the most common and important post-translational modifications (PTM) of eukaryotic proteins. PTM plays a crucial role in various cellular processes and disease pathogenesis. Thus, the accurate identification of N-terminal acetylation modifications is important to gain insight into cellular processes and other possible functional mechanisms. Although some algorithmic models have been proposed, most have been developed based on traditional machine learning algorithms and small training datasets. Their practical applications are limited. Nevertheless, deep learning algorithmic models are better at handling high-throughput and complex data. In this study, DeepCBA, a model based on the hybrid framework of convolutional neural network (CNN), bidirectional long short-term memory network (BiLSTM), and attention mechanism deep learning, was constructed to detect the N-terminal acetylation sites. The DeepCBA was built as follows: First, a benchmark dataset was generated by selecting low-redundant protein sequences from the Uniport database and further reducing the redundancy of the protein sequences using the CD-HIT tool. Subsequently, based on the skip-gram model in the word2vec algorithm, tripeptide word vector features were generated on the benchmark dataset. Finally, the CNN, BiLSTM, and attention mechanism were combined, and the tripeptide word vector features were fed into the stacked model for multiple rounds of training. The model performed excellently on independent dataset test, with accuracy and area under the curve of 80.51% and 87.36%, respectively. Altogether, DeepCBA achieved superior performance compared with the baseline model, and significantly outperformed most existing predictors. Additionally, our model can be used to identify disease loci and drug targets.
•Encoding protein sequences into tripeptide word vectors using the word2vec algorithm.•The attention mechanism improved the recognition of acetylation sites.•The acetylated proteins identified by the model are associated with diseases.</abstract><cop>United States</cop><pub>Elsevier Ltd</pub><pmid>38588617</pmid><doi>10.1016/j.compbiomed.2024.108330</doi><orcidid>https://orcid.org/0009-0009-3399-9301</orcidid></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0010-4825 |
ispartof | Computers in biology and medicine, 2024-05, Vol.174, p.108330, Article 108330 |
issn | 0010-4825 1879-0534 1879-0534 |
language | eng |
recordid | cdi_proquest_miscellaneous_3035076015 |
source | MEDLINE; Elsevier ScienceDirect Journals |
subjects | Acetylation Algorithms Amino acids Artificial neural networks Attention Benchmarks BiLSTM CNN Databases, Protein Datasets Decision trees Deep Learning Enzymes Humans KEGG Localization Long short-term memory Machine learning N-terminal acetylation Neural networks Neural Networks, Computer Pathogenesis Post-translation Protein Processing, Post-Translational Proteins Proteins - chemistry Proteins - metabolism Redundancy Support vector machines Therapeutic targets Tripeptide word vectors |
title | Prediction of protein N-terminal acetylation modification sites based on CNN-BiLSTM-attention model |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T16%3A38%3A21IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Prediction%20of%20protein%20N-terminal%20acetylation%20modification%20sites%20based%20on%20CNN-BiLSTM-attention%20model&rft.jtitle=Computers%20in%20biology%20and%20medicine&rft.au=Ke,%20Jinsong&rft.date=2024-05&rft.volume=174&rft.spage=108330&rft.pages=108330-&rft.artnum=108330&rft.issn=0010-4825&rft.eissn=1879-0534&rft_id=info:doi/10.1016/j.compbiomed.2024.108330&rft_dat=%3Cproquest_cross%3E3046570730%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3046570730&rft_id=info:pmid/38588617&rft_els_id=S0010482524004141&rfr_iscdi=true |