BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning

Recent research trends in computational biology have increasingly focused on integrating text and bio-entity modeling, especially in the context of molecules and proteins. However, previous efforts like BioT5 faced challenges in generalizing across diverse tasks and lacked a nuanced understanding of...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Pei, Qizhi, Wu, Lijun, Gao, Kaiyuan, Liang, Xiaozhuan, Fang, Yin, Zhu, Jinhua, Xie, Shufang, Qin, Tao, Yan, Rui
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Computational Engineering, Finance, and Science Computer Science - Learning Quantitative Biology - Biomolecules Quantitative Biology - Quantitative Methods
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Pei, Qizhi Wu, Lijun Gao, Kaiyuan Liang, Xiaozhuan Fang, Yin Zhu, Jinhua Xie, Shufang Qin, Tao Yan, Rui
description	Recent research trends in computational biology have increasingly focused on integrating text and bio-entity modeling, especially in the context of molecules and proteins. However, previous efforts like BioT5 faced challenges in generalizing across diverse tasks and lacked a nuanced understanding of molecular structures, particularly in their textual representations (e.g., IUPAC). This paper introduces BioT5+, an extension of the BioT5 framework, tailored to enhance biological research and drug discovery. BioT5+ incorporates several novel features: integration of IUPAC names for molecular understanding, inclusion of extensive bio-text and molecule data from sources like bioRxiv and PubChem, the multi-task instruction tuning for generality across tasks, and a numerical tokenization technique for improved processing of numerical data. These enhancements allow BioT5+ to bridge the gap between molecular representations and their textual descriptions, providing a more holistic understanding of biological entities, and largely improving the grounded reasoning of bio-text and bio-sequences. The model is pre-trained and fine-tuned with a large number of experiments, including \emph{3 types of problems (classification, regression, generation), 15 kinds of tasks, and 21 total benchmark datasets}, demonstrating the remarkable performance and state-of-the-art results in most cases. BioT5+ stands out for its ability to capture intricate relationships in biological data, thereby contributing significantly to bioinformatics and computational biology. Our code is available at \url{https://github.com/QizhiPei/BioT5}.
doi_str_mv	10.48550/arxiv.2402.17810
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2402_17810</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2402_17810</sourcerecordid><originalsourceid>FETCH-LOGICAL-a670-ddbdaeefd65d68479c9a51d733998633031dad6694ecdd503a4488fee63213c93</originalsourceid><addsrcrecordid>eNotj71OwzAURr0woMIDMOEdJdjxT2y2EkGJVARDOkeXXCdYBAc5LgWevqUwfcP5dKRDyAVnuTRKsWuIX_4zLyQrcl4azk4J3PqpUVc3tJl2EHGmKxdchNH_OKQHNk6D72Ckm4AuzgkC-jDQnU-vtN48Lytah-SGCMlPgR4ofdyOyWcJ5jfabMPhfEZOehhnd_6_C9Lc3zXVQ7Z-WtXVcp2BLlmG-ILgXI9aoTaytJ0FxbEUwlqjhWCCI6DWVroOUTEBUhrTO6dFwUVnxYJc_mmPje1H9O8Qv9vf1vbYKvbD_E9K</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning</title><source>arXiv.org</source><creator>Pei, Qizhi ; Wu, Lijun ; Gao, Kaiyuan ; Liang, Xiaozhuan ; Fang, Yin ; Zhu, Jinhua ; Xie, Shufang ; Qin, Tao ; Yan, Rui</creator><creatorcontrib>Pei, Qizhi ; Wu, Lijun ; Gao, Kaiyuan ; Liang, Xiaozhuan ; Fang, Yin ; Zhu, Jinhua ; Xie, Shufang ; Qin, Tao ; Yan, Rui</creatorcontrib><description>Recent research trends in computational biology have increasingly focused on integrating text and bio-entity modeling, especially in the context of molecules and proteins. However, previous efforts like BioT5 faced challenges in generalizing across diverse tasks and lacked a nuanced understanding of molecular structures, particularly in their textual representations (e.g., IUPAC). This paper introduces BioT5+, an extension of the BioT5 framework, tailored to enhance biological research and drug discovery. BioT5+ incorporates several novel features: integration of IUPAC names for molecular understanding, inclusion of extensive bio-text and molecule data from sources like bioRxiv and PubChem, the multi-task instruction tuning for generality across tasks, and a numerical tokenization technique for improved processing of numerical data. These enhancements allow BioT5+ to bridge the gap between molecular representations and their textual descriptions, providing a more holistic understanding of biological entities, and largely improving the grounded reasoning of bio-text and bio-sequences. The model is pre-trained and fine-tuned with a large number of experiments, including \emph{3 types of problems (classification, regression, generation), 15 kinds of tasks, and 21 total benchmark datasets}, demonstrating the remarkable performance and state-of-the-art results in most cases. BioT5+ stands out for its ability to capture intricate relationships in biological data, thereby contributing significantly to bioinformatics and computational biology. Our code is available at \url{https://github.com/QizhiPei/BioT5}.</description><identifier>DOI: 10.48550/arxiv.2402.17810</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computational Engineering, Finance, and Science ; Computer Science - Learning ; Quantitative Biology - Biomolecules ; Quantitative Biology - Quantitative Methods</subject><creationdate>2024-02</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2402.17810$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2402.17810$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Pei, Qizhi</creatorcontrib><creatorcontrib>Wu, Lijun</creatorcontrib><creatorcontrib>Gao, Kaiyuan</creatorcontrib><creatorcontrib>Liang, Xiaozhuan</creatorcontrib><creatorcontrib>Fang, Yin</creatorcontrib><creatorcontrib>Zhu, Jinhua</creatorcontrib><creatorcontrib>Xie, Shufang</creatorcontrib><creatorcontrib>Qin, Tao</creatorcontrib><creatorcontrib>Yan, Rui</creatorcontrib><title>BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning</title><description>Recent research trends in computational biology have increasingly focused on integrating text and bio-entity modeling, especially in the context of molecules and proteins. However, previous efforts like BioT5 faced challenges in generalizing across diverse tasks and lacked a nuanced understanding of molecular structures, particularly in their textual representations (e.g., IUPAC). This paper introduces BioT5+, an extension of the BioT5 framework, tailored to enhance biological research and drug discovery. BioT5+ incorporates several novel features: integration of IUPAC names for molecular understanding, inclusion of extensive bio-text and molecule data from sources like bioRxiv and PubChem, the multi-task instruction tuning for generality across tasks, and a numerical tokenization technique for improved processing of numerical data. These enhancements allow BioT5+ to bridge the gap between molecular representations and their textual descriptions, providing a more holistic understanding of biological entities, and largely improving the grounded reasoning of bio-text and bio-sequences. The model is pre-trained and fine-tuned with a large number of experiments, including \emph{3 types of problems (classification, regression, generation), 15 kinds of tasks, and 21 total benchmark datasets}, demonstrating the remarkable performance and state-of-the-art results in most cases. BioT5+ stands out for its ability to capture intricate relationships in biological data, thereby contributing significantly to bioinformatics and computational biology. Our code is available at \url{https://github.com/QizhiPei/BioT5}.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computational Engineering, Finance, and Science</subject><subject>Computer Science - Learning</subject><subject>Quantitative Biology - Biomolecules</subject><subject>Quantitative Biology - Quantitative Methods</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj71OwzAURr0woMIDMOEdJdjxT2y2EkGJVARDOkeXXCdYBAc5LgWevqUwfcP5dKRDyAVnuTRKsWuIX_4zLyQrcl4azk4J3PqpUVc3tJl2EHGmKxdchNH_OKQHNk6D72Ckm4AuzgkC-jDQnU-vtN48Lytah-SGCMlPgR4ofdyOyWcJ5jfabMPhfEZOehhnd_6_C9Lc3zXVQ7Z-WtXVcp2BLlmG-ILgXI9aoTaytJ0FxbEUwlqjhWCCI6DWVroOUTEBUhrTO6dFwUVnxYJc_mmPje1H9O8Qv9vf1vbYKvbD_E9K</recordid><startdate>20240227</startdate><enddate>20240227</enddate><creator>Pei, Qizhi</creator><creator>Wu, Lijun</creator><creator>Gao, Kaiyuan</creator><creator>Liang, Xiaozhuan</creator><creator>Fang, Yin</creator><creator>Zhu, Jinhua</creator><creator>Xie, Shufang</creator><creator>Qin, Tao</creator><creator>Yan, Rui</creator><scope>AKY</scope><scope>ALC</scope><scope>GOX</scope></search><sort><creationdate>20240227</creationdate><title>BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning</title><author>Pei, Qizhi ; Wu, Lijun ; Gao, Kaiyuan ; Liang, Xiaozhuan ; Fang, Yin ; Zhu, Jinhua ; Xie, Shufang ; Qin, Tao ; Yan, Rui</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a670-ddbdaeefd65d68479c9a51d733998633031dad6694ecdd503a4488fee63213c93</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computational Engineering, Finance, and Science</topic><topic>Computer Science - Learning</topic><topic>Quantitative Biology - Biomolecules</topic><topic>Quantitative Biology - Quantitative Methods</topic><toplevel>online_resources</toplevel><creatorcontrib>Pei, Qizhi</creatorcontrib><creatorcontrib>Wu, Lijun</creatorcontrib><creatorcontrib>Gao, Kaiyuan</creatorcontrib><creatorcontrib>Liang, Xiaozhuan</creatorcontrib><creatorcontrib>Fang, Yin</creatorcontrib><creatorcontrib>Zhu, Jinhua</creatorcontrib><creatorcontrib>Xie, Shufang</creatorcontrib><creatorcontrib>Qin, Tao</creatorcontrib><creatorcontrib>Yan, Rui</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv Quantitative Biology</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Pei, Qizhi</au><au>Wu, Lijun</au><au>Gao, Kaiyuan</au><au>Liang, Xiaozhuan</au><au>Fang, Yin</au><au>Zhu, Jinhua</au><au>Xie, Shufang</au><au>Qin, Tao</au><au>Yan, Rui</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning</atitle><date>2024-02-27</date><risdate>2024</risdate><abstract>Recent research trends in computational biology have increasingly focused on integrating text and bio-entity modeling, especially in the context of molecules and proteins. However, previous efforts like BioT5 faced challenges in generalizing across diverse tasks and lacked a nuanced understanding of molecular structures, particularly in their textual representations (e.g., IUPAC). This paper introduces BioT5+, an extension of the BioT5 framework, tailored to enhance biological research and drug discovery. BioT5+ incorporates several novel features: integration of IUPAC names for molecular understanding, inclusion of extensive bio-text and molecule data from sources like bioRxiv and PubChem, the multi-task instruction tuning for generality across tasks, and a numerical tokenization technique for improved processing of numerical data. These enhancements allow BioT5+ to bridge the gap between molecular representations and their textual descriptions, providing a more holistic understanding of biological entities, and largely improving the grounded reasoning of bio-text and bio-sequences. The model is pre-trained and fine-tuned with a large number of experiments, including \emph{3 types of problems (classification, regression, generation), 15 kinds of tasks, and 21 total benchmark datasets}, demonstrating the remarkable performance and state-of-the-art results in most cases. BioT5+ stands out for its ability to capture intricate relationships in biological data, thereby contributing significantly to bioinformatics and computational biology. Our code is available at \url{https://github.com/QizhiPei/BioT5}.</abstract><doi>10.48550/arxiv.2402.17810</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2402.17810
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2402_17810
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Computational Engineering, Finance, and Science Computer Science - Learning Quantitative Biology - Biomolecules Quantitative Biology - Quantitative Methods
title	BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T18%3A05%3A33IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=BioT5+:%20Towards%20Generalized%20Biological%20Understanding%20with%20IUPAC%20Integration%20and%20Multi-task%20Tuning&rft.au=Pei,%20Qizhi&rft.date=2024-02-27&rft_id=info:doi/10.48550/arxiv.2402.17810&rft_dat=%3Carxiv_GOX%3E2402_17810%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true