BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations

Recent advancements in biological research leverage the integration of molecules, proteins, and natural language to enhance drug discovery. However, current models exhibit several limitations, such as the generation of invalid molecular SMILES, underutilization of contextual information, and equal t...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Pei, Qizhi, Zhang, Wei, Zhu, Jinhua, Wu, Kehan, Gao, Kaiyuan, Wu, Lijun, Xia, Yingce, Yan, Rui
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Learning Quantitative Biology - Biomolecules
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Pei, Qizhi Zhang, Wei Zhu, Jinhua Wu, Kehan Gao, Kaiyuan Wu, Lijun Xia, Yingce Yan, Rui
description	Recent advancements in biological research leverage the integration of molecules, proteins, and natural language to enhance drug discovery. However, current models exhibit several limitations, such as the generation of invalid molecular SMILES, underutilization of contextual information, and equal treatment of structured and unstructured knowledge. To address these issues, we propose $\mathbf{BioT5}$, a comprehensive pre-training framework that enriches cross-modal integration in biology with chemical knowledge and natural language associations. $\mathbf{BioT5}$ utilizes SELFIES for $100%$ robust molecular representations and extracts knowledge from the surrounding context of bio-entities in unstructured biological literature. Furthermore, $\mathbf{BioT5}$ distinguishes between structured and unstructured knowledge, leading to more effective utilization of information. After fine-tuning, BioT5 shows superior performance across a wide range of tasks, demonstrating its strong capability of capturing underlying relations and properties of bio-entities. Our code is available at $\href{https://github.com/QizhiPei/BioT5}{Github}$.
doi_str_mv	10.48550/arxiv.2310.07276
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2310_07276</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2310_07276</sourcerecordid><originalsourceid>FETCH-LOGICAL-a676-ab03343a48c2380733b83a400db46190bf56ed9c0ed944f80309abf8b4b98b783</originalsourceid><addsrcrecordid>eNotj8FOhDAURbtxYUY_wJX9AcY3tEBxN5JRJxLdsCevUEoTaE3LOM7fW9HNu3knNzc5hNztYMtFlsED-m_ztU1ZBFCkRX5N5ifjmuyRHqw33WisppV3ISSz63GiR7so7XExzlJjaexOTl_o2SwjrUY1my6W3qw7T6rXiqLt6TsuJx9pjVafMMJ9CK4z60a4IVcDTkHd_ueGNM-HpnpN6o-XY7WvE8yLPEEJjHGGXHQpE1AwJkX8AHrJ810Jcshy1ZcdxMP5IIBBiXIQkstSyEKwDbn_m119209vZvSX9te7Xb3ZDy4rU4E</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations</title><source>arXiv.org</source><creator>Pei, Qizhi ; Zhang, Wei ; Zhu, Jinhua ; Wu, Kehan ; Gao, Kaiyuan ; Wu, Lijun ; Xia, Yingce ; Yan, Rui</creator><creatorcontrib>Pei, Qizhi ; Zhang, Wei ; Zhu, Jinhua ; Wu, Kehan ; Gao, Kaiyuan ; Wu, Lijun ; Xia, Yingce ; Yan, Rui</creatorcontrib><description>Recent advancements in biological research leverage the integration of molecules, proteins, and natural language to enhance drug discovery. However, current models exhibit several limitations, such as the generation of invalid molecular SMILES, underutilization of contextual information, and equal treatment of structured and unstructured knowledge. To address these issues, we propose $\mathbf{BioT5}$, a comprehensive pre-training framework that enriches cross-modal integration in biology with chemical knowledge and natural language associations. $\mathbf{BioT5}$ utilizes SELFIES for $100%$ robust molecular representations and extracts knowledge from the surrounding context of bio-entities in unstructured biological literature. Furthermore, $\mathbf{BioT5}$ distinguishes between structured and unstructured knowledge, leading to more effective utilization of information. After fine-tuning, BioT5 shows superior performance across a wide range of tasks, demonstrating its strong capability of capturing underlying relations and properties of bio-entities. Our code is available at $\href{https://github.com/QizhiPei/BioT5}{Github}$.</description><identifier>DOI: 10.48550/arxiv.2310.07276</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language ; Computer Science - Learning ; Quantitative Biology - Biomolecules</subject><creationdate>2023-10</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2310.07276$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2310.07276$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Pei, Qizhi</creatorcontrib><creatorcontrib>Zhang, Wei</creatorcontrib><creatorcontrib>Zhu, Jinhua</creatorcontrib><creatorcontrib>Wu, Kehan</creatorcontrib><creatorcontrib>Gao, Kaiyuan</creatorcontrib><creatorcontrib>Wu, Lijun</creatorcontrib><creatorcontrib>Xia, Yingce</creatorcontrib><creatorcontrib>Yan, Rui</creatorcontrib><title>BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations</title><description>Recent advancements in biological research leverage the integration of molecules, proteins, and natural language to enhance drug discovery. However, current models exhibit several limitations, such as the generation of invalid molecular SMILES, underutilization of contextual information, and equal treatment of structured and unstructured knowledge. To address these issues, we propose $\mathbf{BioT5}$, a comprehensive pre-training framework that enriches cross-modal integration in biology with chemical knowledge and natural language associations. $\mathbf{BioT5}$ utilizes SELFIES for $100%$ robust molecular representations and extracts knowledge from the surrounding context of bio-entities in unstructured biological literature. Furthermore, $\mathbf{BioT5}$ distinguishes between structured and unstructured knowledge, leading to more effective utilization of information. After fine-tuning, BioT5 shows superior performance across a wide range of tasks, demonstrating its strong capability of capturing underlying relations and properties of bio-entities. Our code is available at $\href{https://github.com/QizhiPei/BioT5}{Github}$.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Learning</subject><subject>Quantitative Biology - Biomolecules</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8FOhDAURbtxYUY_wJX9AcY3tEBxN5JRJxLdsCevUEoTaE3LOM7fW9HNu3knNzc5hNztYMtFlsED-m_ztU1ZBFCkRX5N5ifjmuyRHqw33WisppV3ISSz63GiR7so7XExzlJjaexOTl_o2SwjrUY1my6W3qw7T6rXiqLt6TsuJx9pjVafMMJ9CK4z60a4IVcDTkHd_ueGNM-HpnpN6o-XY7WvE8yLPEEJjHGGXHQpE1AwJkX8AHrJ810Jcshy1ZcdxMP5IIBBiXIQkstSyEKwDbn_m119209vZvSX9te7Xb3ZDy4rU4E</recordid><startdate>20231011</startdate><enddate>20231011</enddate><creator>Pei, Qizhi</creator><creator>Zhang, Wei</creator><creator>Zhu, Jinhua</creator><creator>Wu, Kehan</creator><creator>Gao, Kaiyuan</creator><creator>Wu, Lijun</creator><creator>Xia, Yingce</creator><creator>Yan, Rui</creator><scope>AKY</scope><scope>ALC</scope><scope>GOX</scope></search><sort><creationdate>20231011</creationdate><title>BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations</title><author>Pei, Qizhi ; Zhang, Wei ; Zhu, Jinhua ; Wu, Kehan ; Gao, Kaiyuan ; Wu, Lijun ; Xia, Yingce ; Yan, Rui</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a676-ab03343a48c2380733b83a400db46190bf56ed9c0ed944f80309abf8b4b98b783</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Learning</topic><topic>Quantitative Biology - Biomolecules</topic><toplevel>online_resources</toplevel><creatorcontrib>Pei, Qizhi</creatorcontrib><creatorcontrib>Zhang, Wei</creatorcontrib><creatorcontrib>Zhu, Jinhua</creatorcontrib><creatorcontrib>Wu, Kehan</creatorcontrib><creatorcontrib>Gao, Kaiyuan</creatorcontrib><creatorcontrib>Wu, Lijun</creatorcontrib><creatorcontrib>Xia, Yingce</creatorcontrib><creatorcontrib>Yan, Rui</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv Quantitative Biology</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Pei, Qizhi</au><au>Zhang, Wei</au><au>Zhu, Jinhua</au><au>Wu, Kehan</au><au>Gao, Kaiyuan</au><au>Wu, Lijun</au><au>Xia, Yingce</au><au>Yan, Rui</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations</atitle><date>2023-10-11</date><risdate>2023</risdate><abstract>Recent advancements in biological research leverage the integration of molecules, proteins, and natural language to enhance drug discovery. However, current models exhibit several limitations, such as the generation of invalid molecular SMILES, underutilization of contextual information, and equal treatment of structured and unstructured knowledge. To address these issues, we propose $\mathbf{BioT5}$, a comprehensive pre-training framework that enriches cross-modal integration in biology with chemical knowledge and natural language associations. $\mathbf{BioT5}$ utilizes SELFIES for $100%$ robust molecular representations and extracts knowledge from the surrounding context of bio-entities in unstructured biological literature. Furthermore, $\mathbf{BioT5}$ distinguishes between structured and unstructured knowledge, leading to more effective utilization of information. After fine-tuning, BioT5 shows superior performance across a wide range of tasks, demonstrating its strong capability of capturing underlying relations and properties of bio-entities. Our code is available at $\href{https://github.com/QizhiPei/BioT5}{Github}$.</abstract><doi>10.48550/arxiv.2310.07276</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2310.07276
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2310_07276
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Learning Quantitative Biology - Biomolecules
title	BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T18%3A21%3A41IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=BioT5:%20Enriching%20Cross-modal%20Integration%20in%20Biology%20with%20Chemical%20Knowledge%20and%20Natural%20Language%20Associations&rft.au=Pei,%20Qizhi&rft.date=2023-10-11&rft_id=info:doi/10.48550/arxiv.2310.07276&rft_dat=%3Carxiv_GOX%3E2310_07276%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true