Prediction of protein N-terminal acetylation modification sites based on CNN-BiLSTM-attention model

N-terminal acetylation is one of the most common and important post-translational modifications (PTM) of eukaryotic proteins. PTM plays a crucial role in various cellular processes and disease pathogenesis. Thus, the accurate identification of N-terminal acetylation modifications is important to gai...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Computers in biology and medicine 2024-05, Vol.174, p.108330, Article 108330
Hauptverfasser: Ke, Jinsong, Zhao, Jianmei, Li, Hongfei, Yuan, Lei, Dong, Guanghui, Wang, Guohua
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page 108330
container_title Computers in biology and medicine
container_volume 174
creator Ke, Jinsong
Zhao, Jianmei
Li, Hongfei
Yuan, Lei
Dong, Guanghui
Wang, Guohua
description N-terminal acetylation is one of the most common and important post-translational modifications (PTM) of eukaryotic proteins. PTM plays a crucial role in various cellular processes and disease pathogenesis. Thus, the accurate identification of N-terminal acetylation modifications is important to gain insight into cellular processes and other possible functional mechanisms. Although some algorithmic models have been proposed, most have been developed based on traditional machine learning algorithms and small training datasets. Their practical applications are limited. Nevertheless, deep learning algorithmic models are better at handling high-throughput and complex data. In this study, DeepCBA, a model based on the hybrid framework of convolutional neural network (CNN), bidirectional long short-term memory network (BiLSTM), and attention mechanism deep learning, was constructed to detect the N-terminal acetylation sites. The DeepCBA was built as follows: First, a benchmark dataset was generated by selecting low-redundant protein sequences from the Uniport database and further reducing the redundancy of the protein sequences using the CD-HIT tool. Subsequently, based on the skip-gram model in the word2vec algorithm, tripeptide word vector features were generated on the benchmark dataset. Finally, the CNN, BiLSTM, and attention mechanism were combined, and the tripeptide word vector features were fed into the stacked model for multiple rounds of training. The model performed excellently on independent dataset test, with accuracy and area under the curve of 80.51% and 87.36%, respectively. Altogether, DeepCBA achieved superior performance compared with the baseline model, and significantly outperformed most existing predictors. Additionally, our model can be used to identify disease loci and drug targets. •Encoding protein sequences into tripeptide word vectors using the word2vec algorithm.•The attention mechanism improved the recognition of acetylation sites.•The acetylated proteins identified by the model are associated with diseases.
doi_str_mv 10.1016/j.compbiomed.2024.108330
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_3035076015</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0010482524004141</els_id><sourcerecordid>3046570730</sourcerecordid><originalsourceid>FETCH-LOGICAL-c262t-71a3a50eaf479ca07b8158c693c41310fdce9ff52d78bf611371473e0ce1d0053</originalsourceid><addsrcrecordid>eNqFkdFr2zAQxsVoWdJs_8Iw9KUvzk6WbSmPa-i2QpYWmj0LWT6Bgm1lklLIfz95Thj0pU_idL_vju8-QjIKSwq0_rpfatcfGut6bJcFFGX6FozBBzKngq9yqFh5ReYAFPJSFNWM3ISwB4ASGHwkMyYqIWrK50Q_e2ytjtYNmTPZwbuIdsi2eUTf20F1mdIYT536R_SutcbqqQg2YsgaFbDNUrnebvN7u3nZ_cpVjDhcBNh9ItdGdQE_n98F-f39Ybf-mW-efjyuv21yXdRFzDlVTFWAypR8pRXwRtBK6HrFdEkZBdNqXBlTFS0XjakpZZyWnCFopC0kywtyN81NLv4cMUTZ26Cx69SA7hgkA1YBr4GO6O0bdO-OPtkdqbKuOPBEL4iYKO1dCB6NPHjbK3-SFOQYhNzL_0HIMQg5BZGkX84Ljs3Yuwgvl0_A_QRgusirRS-DtjjolIZHHWXr7Ptb_gKnpp3q</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3046570730</pqid></control><display><type>article</type><title>Prediction of protein N-terminal acetylation modification sites based on CNN-BiLSTM-attention model</title><source>MEDLINE</source><source>Elsevier ScienceDirect Journals</source><creator>Ke, Jinsong ; Zhao, Jianmei ; Li, Hongfei ; Yuan, Lei ; Dong, Guanghui ; Wang, Guohua</creator><creatorcontrib>Ke, Jinsong ; Zhao, Jianmei ; Li, Hongfei ; Yuan, Lei ; Dong, Guanghui ; Wang, Guohua</creatorcontrib><description>N-terminal acetylation is one of the most common and important post-translational modifications (PTM) of eukaryotic proteins. PTM plays a crucial role in various cellular processes and disease pathogenesis. Thus, the accurate identification of N-terminal acetylation modifications is important to gain insight into cellular processes and other possible functional mechanisms. Although some algorithmic models have been proposed, most have been developed based on traditional machine learning algorithms and small training datasets. Their practical applications are limited. Nevertheless, deep learning algorithmic models are better at handling high-throughput and complex data. In this study, DeepCBA, a model based on the hybrid framework of convolutional neural network (CNN), bidirectional long short-term memory network (BiLSTM), and attention mechanism deep learning, was constructed to detect the N-terminal acetylation sites. The DeepCBA was built as follows: First, a benchmark dataset was generated by selecting low-redundant protein sequences from the Uniport database and further reducing the redundancy of the protein sequences using the CD-HIT tool. Subsequently, based on the skip-gram model in the word2vec algorithm, tripeptide word vector features were generated on the benchmark dataset. Finally, the CNN, BiLSTM, and attention mechanism were combined, and the tripeptide word vector features were fed into the stacked model for multiple rounds of training. The model performed excellently on independent dataset test, with accuracy and area under the curve of 80.51% and 87.36%, respectively. Altogether, DeepCBA achieved superior performance compared with the baseline model, and significantly outperformed most existing predictors. Additionally, our model can be used to identify disease loci and drug targets. •Encoding protein sequences into tripeptide word vectors using the word2vec algorithm.•The attention mechanism improved the recognition of acetylation sites.•The acetylated proteins identified by the model are associated with diseases.</description><identifier>ISSN: 0010-4825</identifier><identifier>ISSN: 1879-0534</identifier><identifier>EISSN: 1879-0534</identifier><identifier>DOI: 10.1016/j.compbiomed.2024.108330</identifier><identifier>PMID: 38588617</identifier><language>eng</language><publisher>United States: Elsevier Ltd</publisher><subject>Acetylation ; Algorithms ; Amino acids ; Artificial neural networks ; Attention ; Benchmarks ; BiLSTM ; CNN ; Databases, Protein ; Datasets ; Decision trees ; Deep Learning ; Enzymes ; Humans ; KEGG ; Localization ; Long short-term memory ; Machine learning ; N-terminal acetylation ; Neural networks ; Neural Networks, Computer ; Pathogenesis ; Post-translation ; Protein Processing, Post-Translational ; Proteins ; Proteins - chemistry ; Proteins - metabolism ; Redundancy ; Support vector machines ; Therapeutic targets ; Tripeptide word vectors</subject><ispartof>Computers in biology and medicine, 2024-05, Vol.174, p.108330, Article 108330</ispartof><rights>2024</rights><rights>Copyright © 2024. Published by Elsevier Ltd.</rights><rights>Copyright Elsevier Limited May 2024</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c262t-71a3a50eaf479ca07b8158c693c41310fdce9ff52d78bf611371473e0ce1d0053</cites><orcidid>0009-0009-3399-9301</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.sciencedirect.com/science/article/pii/S0010482524004141$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>314,776,780,3537,27901,27902,65306</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/38588617$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Ke, Jinsong</creatorcontrib><creatorcontrib>Zhao, Jianmei</creatorcontrib><creatorcontrib>Li, Hongfei</creatorcontrib><creatorcontrib>Yuan, Lei</creatorcontrib><creatorcontrib>Dong, Guanghui</creatorcontrib><creatorcontrib>Wang, Guohua</creatorcontrib><title>Prediction of protein N-terminal acetylation modification sites based on CNN-BiLSTM-attention model</title><title>Computers in biology and medicine</title><addtitle>Comput Biol Med</addtitle><description>N-terminal acetylation is one of the most common and important post-translational modifications (PTM) of eukaryotic proteins. PTM plays a crucial role in various cellular processes and disease pathogenesis. Thus, the accurate identification of N-terminal acetylation modifications is important to gain insight into cellular processes and other possible functional mechanisms. Although some algorithmic models have been proposed, most have been developed based on traditional machine learning algorithms and small training datasets. Their practical applications are limited. Nevertheless, deep learning algorithmic models are better at handling high-throughput and complex data. In this study, DeepCBA, a model based on the hybrid framework of convolutional neural network (CNN), bidirectional long short-term memory network (BiLSTM), and attention mechanism deep learning, was constructed to detect the N-terminal acetylation sites. The DeepCBA was built as follows: First, a benchmark dataset was generated by selecting low-redundant protein sequences from the Uniport database and further reducing the redundancy of the protein sequences using the CD-HIT tool. Subsequently, based on the skip-gram model in the word2vec algorithm, tripeptide word vector features were generated on the benchmark dataset. Finally, the CNN, BiLSTM, and attention mechanism were combined, and the tripeptide word vector features were fed into the stacked model for multiple rounds of training. The model performed excellently on independent dataset test, with accuracy and area under the curve of 80.51% and 87.36%, respectively. Altogether, DeepCBA achieved superior performance compared with the baseline model, and significantly outperformed most existing predictors. Additionally, our model can be used to identify disease loci and drug targets. •Encoding protein sequences into tripeptide word vectors using the word2vec algorithm.•The attention mechanism improved the recognition of acetylation sites.•The acetylated proteins identified by the model are associated with diseases.</description><subject>Acetylation</subject><subject>Algorithms</subject><subject>Amino acids</subject><subject>Artificial neural networks</subject><subject>Attention</subject><subject>Benchmarks</subject><subject>BiLSTM</subject><subject>CNN</subject><subject>Databases, Protein</subject><subject>Datasets</subject><subject>Decision trees</subject><subject>Deep Learning</subject><subject>Enzymes</subject><subject>Humans</subject><subject>KEGG</subject><subject>Localization</subject><subject>Long short-term memory</subject><subject>Machine learning</subject><subject>N-terminal acetylation</subject><subject>Neural networks</subject><subject>Neural Networks, Computer</subject><subject>Pathogenesis</subject><subject>Post-translation</subject><subject>Protein Processing, Post-Translational</subject><subject>Proteins</subject><subject>Proteins - chemistry</subject><subject>Proteins - metabolism</subject><subject>Redundancy</subject><subject>Support vector machines</subject><subject>Therapeutic targets</subject><subject>Tripeptide word vectors</subject><issn>0010-4825</issn><issn>1879-0534</issn><issn>1879-0534</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><recordid>eNqFkdFr2zAQxsVoWdJs_8Iw9KUvzk6WbSmPa-i2QpYWmj0LWT6Bgm1lklLIfz95Thj0pU_idL_vju8-QjIKSwq0_rpfatcfGut6bJcFFGX6FozBBzKngq9yqFh5ReYAFPJSFNWM3ISwB4ASGHwkMyYqIWrK50Q_e2ytjtYNmTPZwbuIdsi2eUTf20F1mdIYT536R_SutcbqqQg2YsgaFbDNUrnebvN7u3nZ_cpVjDhcBNh9ItdGdQE_n98F-f39Ybf-mW-efjyuv21yXdRFzDlVTFWAypR8pRXwRtBK6HrFdEkZBdNqXBlTFS0XjakpZZyWnCFopC0kywtyN81NLv4cMUTZ26Cx69SA7hgkA1YBr4GO6O0bdO-OPtkdqbKuOPBEL4iYKO1dCB6NPHjbK3-SFOQYhNzL_0HIMQg5BZGkX84Ljs3Yuwgvl0_A_QRgusirRS-DtjjolIZHHWXr7Ptb_gKnpp3q</recordid><startdate>202405</startdate><enddate>202405</enddate><creator>Ke, Jinsong</creator><creator>Zhao, Jianmei</creator><creator>Li, Hongfei</creator><creator>Yuan, Lei</creator><creator>Dong, Guanghui</creator><creator>Wang, Guohua</creator><general>Elsevier Ltd</general><general>Elsevier Limited</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>8FD</scope><scope>FR3</scope><scope>JQ2</scope><scope>K9.</scope><scope>M7Z</scope><scope>NAPCQ</scope><scope>P64</scope><scope>7X8</scope><orcidid>https://orcid.org/0009-0009-3399-9301</orcidid></search><sort><creationdate>202405</creationdate><title>Prediction of protein N-terminal acetylation modification sites based on CNN-BiLSTM-attention model</title><author>Ke, Jinsong ; Zhao, Jianmei ; Li, Hongfei ; Yuan, Lei ; Dong, Guanghui ; Wang, Guohua</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c262t-71a3a50eaf479ca07b8158c693c41310fdce9ff52d78bf611371473e0ce1d0053</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Acetylation</topic><topic>Algorithms</topic><topic>Amino acids</topic><topic>Artificial neural networks</topic><topic>Attention</topic><topic>Benchmarks</topic><topic>BiLSTM</topic><topic>CNN</topic><topic>Databases, Protein</topic><topic>Datasets</topic><topic>Decision trees</topic><topic>Deep Learning</topic><topic>Enzymes</topic><topic>Humans</topic><topic>KEGG</topic><topic>Localization</topic><topic>Long short-term memory</topic><topic>Machine learning</topic><topic>N-terminal acetylation</topic><topic>Neural networks</topic><topic>Neural Networks, Computer</topic><topic>Pathogenesis</topic><topic>Post-translation</topic><topic>Protein Processing, Post-Translational</topic><topic>Proteins</topic><topic>Proteins - chemistry</topic><topic>Proteins - metabolism</topic><topic>Redundancy</topic><topic>Support vector machines</topic><topic>Therapeutic targets</topic><topic>Tripeptide word vectors</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ke, Jinsong</creatorcontrib><creatorcontrib>Zhao, Jianmei</creatorcontrib><creatorcontrib>Li, Hongfei</creatorcontrib><creatorcontrib>Yuan, Lei</creatorcontrib><creatorcontrib>Dong, Guanghui</creatorcontrib><creatorcontrib>Wang, Guohua</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Technology Research Database</collection><collection>Engineering Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><collection>Biochemistry Abstracts 1</collection><collection>Nursing &amp; Allied Health Premium</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><jtitle>Computers in biology and medicine</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ke, Jinsong</au><au>Zhao, Jianmei</au><au>Li, Hongfei</au><au>Yuan, Lei</au><au>Dong, Guanghui</au><au>Wang, Guohua</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Prediction of protein N-terminal acetylation modification sites based on CNN-BiLSTM-attention model</atitle><jtitle>Computers in biology and medicine</jtitle><addtitle>Comput Biol Med</addtitle><date>2024-05</date><risdate>2024</risdate><volume>174</volume><spage>108330</spage><pages>108330-</pages><artnum>108330</artnum><issn>0010-4825</issn><issn>1879-0534</issn><eissn>1879-0534</eissn><abstract>N-terminal acetylation is one of the most common and important post-translational modifications (PTM) of eukaryotic proteins. PTM plays a crucial role in various cellular processes and disease pathogenesis. Thus, the accurate identification of N-terminal acetylation modifications is important to gain insight into cellular processes and other possible functional mechanisms. Although some algorithmic models have been proposed, most have been developed based on traditional machine learning algorithms and small training datasets. Their practical applications are limited. Nevertheless, deep learning algorithmic models are better at handling high-throughput and complex data. In this study, DeepCBA, a model based on the hybrid framework of convolutional neural network (CNN), bidirectional long short-term memory network (BiLSTM), and attention mechanism deep learning, was constructed to detect the N-terminal acetylation sites. The DeepCBA was built as follows: First, a benchmark dataset was generated by selecting low-redundant protein sequences from the Uniport database and further reducing the redundancy of the protein sequences using the CD-HIT tool. Subsequently, based on the skip-gram model in the word2vec algorithm, tripeptide word vector features were generated on the benchmark dataset. Finally, the CNN, BiLSTM, and attention mechanism were combined, and the tripeptide word vector features were fed into the stacked model for multiple rounds of training. The model performed excellently on independent dataset test, with accuracy and area under the curve of 80.51% and 87.36%, respectively. Altogether, DeepCBA achieved superior performance compared with the baseline model, and significantly outperformed most existing predictors. Additionally, our model can be used to identify disease loci and drug targets. •Encoding protein sequences into tripeptide word vectors using the word2vec algorithm.•The attention mechanism improved the recognition of acetylation sites.•The acetylated proteins identified by the model are associated with diseases.</abstract><cop>United States</cop><pub>Elsevier Ltd</pub><pmid>38588617</pmid><doi>10.1016/j.compbiomed.2024.108330</doi><orcidid>https://orcid.org/0009-0009-3399-9301</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 0010-4825
ispartof Computers in biology and medicine, 2024-05, Vol.174, p.108330, Article 108330
issn 0010-4825
1879-0534
1879-0534
language eng
recordid cdi_proquest_miscellaneous_3035076015
source MEDLINE; Elsevier ScienceDirect Journals
subjects Acetylation
Algorithms
Amino acids
Artificial neural networks
Attention
Benchmarks
BiLSTM
CNN
Databases, Protein
Datasets
Decision trees
Deep Learning
Enzymes
Humans
KEGG
Localization
Long short-term memory
Machine learning
N-terminal acetylation
Neural networks
Neural Networks, Computer
Pathogenesis
Post-translation
Protein Processing, Post-Translational
Proteins
Proteins - chemistry
Proteins - metabolism
Redundancy
Support vector machines
Therapeutic targets
Tripeptide word vectors
title Prediction of protein N-terminal acetylation modification sites based on CNN-BiLSTM-attention model
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T16%3A38%3A21IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Prediction%20of%20protein%20N-terminal%20acetylation%20modification%20sites%20based%20on%20CNN-BiLSTM-attention%20model&rft.jtitle=Computers%20in%20biology%20and%20medicine&rft.au=Ke,%20Jinsong&rft.date=2024-05&rft.volume=174&rft.spage=108330&rft.pages=108330-&rft.artnum=108330&rft.issn=0010-4825&rft.eissn=1879-0534&rft_id=info:doi/10.1016/j.compbiomed.2024.108330&rft_dat=%3Cproquest_cross%3E3046570730%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3046570730&rft_id=info:pmid/38588617&rft_els_id=S0010482524004141&rfr_iscdi=true