4 mC site recognition algorithm based on pruned pre-trained DNABert-Pruning model and fused artificial feature encoding
DNA 4 mC plays a crucial role in the genetic expression process of organisms. However, existing deep learning algorithms have shortcomings in the ability to represent DNA sequence features. In this paper, we propose a 4 mC site identification algorithm, DNABert-4mC, based on a fusion of the pruned p...
Gespeichert in:
Veröffentlicht in: | Analytical biochemistry 2024-06, Vol.689, p.115492-115492, Article 115492 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 115492 |
---|---|
container_issue | |
container_start_page | 115492 |
container_title | Analytical biochemistry |
container_volume | 689 |
creator | Xie, Guo-Bo Yu, Yi Lin, Zhi-Yi Chen, Rui-Bin Xie, Jian-Hui Liu, Zhen-Guo |
description | DNA 4 mC plays a crucial role in the genetic expression process of organisms. However, existing deep learning algorithms have shortcomings in the ability to represent DNA sequence features. In this paper, we propose a 4 mC site identification algorithm, DNABert-4mC, based on a fusion of the pruned pre-training DNABert-Pruning model and artificial feature encoding to identify 4 mC sites. The algorithm prunes and compresses the DNABert model, resulting in the pruned pre-training model DNABert-Pruning. This model reduces the number of parameters and removes redundancy from output features, yielding more precise feature representations while upholding accuracy.Simultaneously, the algorithm constructs an artificial feature encoding module to assist the DNABert-Pruning model in feature representation, effectively supplementing the information that is missing from the pre-trained features. The algorithm also introduces the AFF-4mC fusion strategy, which combines artificial feature encoding with the DNABert-Pruning model, to improve the feature representation capability of DNA sequences in multi-semantic spaces and better extract 4 mC sites and the distribution of nucleotide importance within the sequence. In experiments on six independent test sets, the DNABert-4mC algorithm achieved an average AUC value of 93.81%, outperforming seven other advanced algorithms with improvements of 2.05%, 5.02%, 11.32%, 5.90%, 12.02%, 2.42% and 2.34%, respectively.
[Display omitted]
•Train a new pruning pretraining model on DNA sequences to extract feature information.•The introduction of manually assisted features has expanded the search space for sequence feature.•Attention fusion strategy better promotes the fusion output of features from different dimensions.•The complementary nature of machine features and manual features better fits the target features. |
doi_str_mv | 10.1016/j.ab.2024.115492 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_2954778460</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0003269724000368</els_id><sourcerecordid>2954778460</sourcerecordid><originalsourceid>FETCH-LOGICAL-c303t-4685b77e1f14c353016a965a583ad3e4d67aba88ef5e258432124ae09210fc183</originalsourceid><addsrcrecordid>eNp1kLuOFDEQRS0EYmcXciLkkKSH8qsfZMsAC9IKCCC2qtvVg0fd7cF2o-Vv-Ba-DI9mISNyyT73ynUYeyZgK0DULw9b7LcSpN4KYXQnH7CNgK6uQEH3kG0AQFWy7poLdpnSAUAIberH7EK12rQKmg27079_zTuefCYeaQj7xWcfFo7TPkSfv828x0SOl6tjXJcyHSNVOaI_zW8-Xr-mmKvP5ckvez4HRxPHxfFxPaUwZj_6wePER8K8RuK0DMEV9gl7NOKU6On9ecW-vnv7Zfe-uv1082F3fVsNClSudN2avmlIjEIPyqiyNXa1wfJ9dIq0qxvssW1pNCRNq5UUUiNBJwWMg2jVFXtx7j3G8H2llO3s00DThAuFNVnZGd00ra6hoHBGhxhSijTaY_Qzxp9WgD35tgeLvT35tmffJfL8vn3tZ3L_An8FF-DVGaCy4w9P0abBFwfkfNGdrQv-_-1_AI2pkCA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2954778460</pqid></control><display><type>article</type><title>4 mC site recognition algorithm based on pruned pre-trained DNABert-Pruning model and fused artificial feature encoding</title><source>Elsevier ScienceDirect Journals Complete</source><creator>Xie, Guo-Bo ; Yu, Yi ; Lin, Zhi-Yi ; Chen, Rui-Bin ; Xie, Jian-Hui ; Liu, Zhen-Guo</creator><creatorcontrib>Xie, Guo-Bo ; Yu, Yi ; Lin, Zhi-Yi ; Chen, Rui-Bin ; Xie, Jian-Hui ; Liu, Zhen-Guo</creatorcontrib><description>DNA 4 mC plays a crucial role in the genetic expression process of organisms. However, existing deep learning algorithms have shortcomings in the ability to represent DNA sequence features. In this paper, we propose a 4 mC site identification algorithm, DNABert-4mC, based on a fusion of the pruned pre-training DNABert-Pruning model and artificial feature encoding to identify 4 mC sites. The algorithm prunes and compresses the DNABert model, resulting in the pruned pre-training model DNABert-Pruning. This model reduces the number of parameters and removes redundancy from output features, yielding more precise feature representations while upholding accuracy.Simultaneously, the algorithm constructs an artificial feature encoding module to assist the DNABert-Pruning model in feature representation, effectively supplementing the information that is missing from the pre-trained features. The algorithm also introduces the AFF-4mC fusion strategy, which combines artificial feature encoding with the DNABert-Pruning model, to improve the feature representation capability of DNA sequences in multi-semantic spaces and better extract 4 mC sites and the distribution of nucleotide importance within the sequence. In experiments on six independent test sets, the DNABert-4mC algorithm achieved an average AUC value of 93.81%, outperforming seven other advanced algorithms with improvements of 2.05%, 5.02%, 11.32%, 5.90%, 12.02%, 2.42% and 2.34%, respectively.
[Display omitted]
•Train a new pruning pretraining model on DNA sequences to extract feature information.•The introduction of manually assisted features has expanded the search space for sequence feature.•Attention fusion strategy better promotes the fusion output of features from different dimensions.•The complementary nature of machine features and manual features better fits the target features.</description><identifier>ISSN: 0003-2697</identifier><identifier>EISSN: 1096-0309</identifier><identifier>DOI: 10.1016/j.ab.2024.115492</identifier><identifier>PMID: 38458307</identifier><language>eng</language><publisher>United States: Elsevier Inc</publisher><subject>4 mC ; DNABert-4mC ; Feature fusion ; Pre-training ; Pruning</subject><ispartof>Analytical biochemistry, 2024-06, Vol.689, p.115492-115492, Article 115492</ispartof><rights>2024 Elsevier Inc.</rights><rights>Copyright © 2024. Published by Elsevier Inc.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c303t-4685b77e1f14c353016a965a583ad3e4d67aba88ef5e258432124ae09210fc183</cites><orcidid>0000-0002-3464-3472</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.ab.2024.115492$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>314,780,784,3550,27924,27925,45995</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/38458307$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Xie, Guo-Bo</creatorcontrib><creatorcontrib>Yu, Yi</creatorcontrib><creatorcontrib>Lin, Zhi-Yi</creatorcontrib><creatorcontrib>Chen, Rui-Bin</creatorcontrib><creatorcontrib>Xie, Jian-Hui</creatorcontrib><creatorcontrib>Liu, Zhen-Guo</creatorcontrib><title>4 mC site recognition algorithm based on pruned pre-trained DNABert-Pruning model and fused artificial feature encoding</title><title>Analytical biochemistry</title><addtitle>Anal Biochem</addtitle><description>DNA 4 mC plays a crucial role in the genetic expression process of organisms. However, existing deep learning algorithms have shortcomings in the ability to represent DNA sequence features. In this paper, we propose a 4 mC site identification algorithm, DNABert-4mC, based on a fusion of the pruned pre-training DNABert-Pruning model and artificial feature encoding to identify 4 mC sites. The algorithm prunes and compresses the DNABert model, resulting in the pruned pre-training model DNABert-Pruning. This model reduces the number of parameters and removes redundancy from output features, yielding more precise feature representations while upholding accuracy.Simultaneously, the algorithm constructs an artificial feature encoding module to assist the DNABert-Pruning model in feature representation, effectively supplementing the information that is missing from the pre-trained features. The algorithm also introduces the AFF-4mC fusion strategy, which combines artificial feature encoding with the DNABert-Pruning model, to improve the feature representation capability of DNA sequences in multi-semantic spaces and better extract 4 mC sites and the distribution of nucleotide importance within the sequence. In experiments on six independent test sets, the DNABert-4mC algorithm achieved an average AUC value of 93.81%, outperforming seven other advanced algorithms with improvements of 2.05%, 5.02%, 11.32%, 5.90%, 12.02%, 2.42% and 2.34%, respectively.
[Display omitted]
•Train a new pruning pretraining model on DNA sequences to extract feature information.•The introduction of manually assisted features has expanded the search space for sequence feature.•Attention fusion strategy better promotes the fusion output of features from different dimensions.•The complementary nature of machine features and manual features better fits the target features.</description><subject>4 mC</subject><subject>DNABert-4mC</subject><subject>Feature fusion</subject><subject>Pre-training</subject><subject>Pruning</subject><issn>0003-2697</issn><issn>1096-0309</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNp1kLuOFDEQRS0EYmcXciLkkKSH8qsfZMsAC9IKCCC2qtvVg0fd7cF2o-Vv-Ba-DI9mISNyyT73ynUYeyZgK0DULw9b7LcSpN4KYXQnH7CNgK6uQEH3kG0AQFWy7poLdpnSAUAIberH7EK12rQKmg27079_zTuefCYeaQj7xWcfFo7TPkSfv828x0SOl6tjXJcyHSNVOaI_zW8-Xr-mmKvP5ckvez4HRxPHxfFxPaUwZj_6wePER8K8RuK0DMEV9gl7NOKU6On9ecW-vnv7Zfe-uv1082F3fVsNClSudN2avmlIjEIPyqiyNXa1wfJ9dIq0qxvssW1pNCRNq5UUUiNBJwWMg2jVFXtx7j3G8H2llO3s00DThAuFNVnZGd00ra6hoHBGhxhSijTaY_Qzxp9WgD35tgeLvT35tmffJfL8vn3tZ3L_An8FF-DVGaCy4w9P0abBFwfkfNGdrQv-_-1_AI2pkCA</recordid><startdate>20240601</startdate><enddate>20240601</enddate><creator>Xie, Guo-Bo</creator><creator>Yu, Yi</creator><creator>Lin, Zhi-Yi</creator><creator>Chen, Rui-Bin</creator><creator>Xie, Jian-Hui</creator><creator>Liu, Zhen-Guo</creator><general>Elsevier Inc</general><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-3464-3472</orcidid></search><sort><creationdate>20240601</creationdate><title>4 mC site recognition algorithm based on pruned pre-trained DNABert-Pruning model and fused artificial feature encoding</title><author>Xie, Guo-Bo ; Yu, Yi ; Lin, Zhi-Yi ; Chen, Rui-Bin ; Xie, Jian-Hui ; Liu, Zhen-Guo</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c303t-4685b77e1f14c353016a965a583ad3e4d67aba88ef5e258432124ae09210fc183</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>4 mC</topic><topic>DNABert-4mC</topic><topic>Feature fusion</topic><topic>Pre-training</topic><topic>Pruning</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Xie, Guo-Bo</creatorcontrib><creatorcontrib>Yu, Yi</creatorcontrib><creatorcontrib>Lin, Zhi-Yi</creatorcontrib><creatorcontrib>Chen, Rui-Bin</creatorcontrib><creatorcontrib>Xie, Jian-Hui</creatorcontrib><creatorcontrib>Liu, Zhen-Guo</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><jtitle>Analytical biochemistry</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Xie, Guo-Bo</au><au>Yu, Yi</au><au>Lin, Zhi-Yi</au><au>Chen, Rui-Bin</au><au>Xie, Jian-Hui</au><au>Liu, Zhen-Guo</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>4 mC site recognition algorithm based on pruned pre-trained DNABert-Pruning model and fused artificial feature encoding</atitle><jtitle>Analytical biochemistry</jtitle><addtitle>Anal Biochem</addtitle><date>2024-06-01</date><risdate>2024</risdate><volume>689</volume><spage>115492</spage><epage>115492</epage><pages>115492-115492</pages><artnum>115492</artnum><issn>0003-2697</issn><eissn>1096-0309</eissn><abstract>DNA 4 mC plays a crucial role in the genetic expression process of organisms. However, existing deep learning algorithms have shortcomings in the ability to represent DNA sequence features. In this paper, we propose a 4 mC site identification algorithm, DNABert-4mC, based on a fusion of the pruned pre-training DNABert-Pruning model and artificial feature encoding to identify 4 mC sites. The algorithm prunes and compresses the DNABert model, resulting in the pruned pre-training model DNABert-Pruning. This model reduces the number of parameters and removes redundancy from output features, yielding more precise feature representations while upholding accuracy.Simultaneously, the algorithm constructs an artificial feature encoding module to assist the DNABert-Pruning model in feature representation, effectively supplementing the information that is missing from the pre-trained features. The algorithm also introduces the AFF-4mC fusion strategy, which combines artificial feature encoding with the DNABert-Pruning model, to improve the feature representation capability of DNA sequences in multi-semantic spaces and better extract 4 mC sites and the distribution of nucleotide importance within the sequence. In experiments on six independent test sets, the DNABert-4mC algorithm achieved an average AUC value of 93.81%, outperforming seven other advanced algorithms with improvements of 2.05%, 5.02%, 11.32%, 5.90%, 12.02%, 2.42% and 2.34%, respectively.
[Display omitted]
•Train a new pruning pretraining model on DNA sequences to extract feature information.•The introduction of manually assisted features has expanded the search space for sequence feature.•Attention fusion strategy better promotes the fusion output of features from different dimensions.•The complementary nature of machine features and manual features better fits the target features.</abstract><cop>United States</cop><pub>Elsevier Inc</pub><pmid>38458307</pmid><doi>10.1016/j.ab.2024.115492</doi><tpages>1</tpages><orcidid>https://orcid.org/0000-0002-3464-3472</orcidid></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0003-2697 |
ispartof | Analytical biochemistry, 2024-06, Vol.689, p.115492-115492, Article 115492 |
issn | 0003-2697 1096-0309 |
language | eng |
recordid | cdi_proquest_miscellaneous_2954778460 |
source | Elsevier ScienceDirect Journals Complete |
subjects | 4 mC DNABert-4mC Feature fusion Pre-training Pruning |
title | 4 mC site recognition algorithm based on pruned pre-trained DNABert-Pruning model and fused artificial feature encoding |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T11%3A57%3A07IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=4%C2%A0mC%20site%20recognition%20algorithm%20based%20on%20pruned%20pre-trained%20DNABert-Pruning%20model%20and%20fused%20artificial%20feature%20encoding&rft.jtitle=Analytical%20biochemistry&rft.au=Xie,%20Guo-Bo&rft.date=2024-06-01&rft.volume=689&rft.spage=115492&rft.epage=115492&rft.pages=115492-115492&rft.artnum=115492&rft.issn=0003-2697&rft.eissn=1096-0309&rft_id=info:doi/10.1016/j.ab.2024.115492&rft_dat=%3Cproquest_cross%3E2954778460%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2954778460&rft_id=info:pmid/38458307&rft_els_id=S0003269724000368&rfr_iscdi=true |