BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations

Recent advancements in biological research leverage the integration of molecules, proteins, and natural language to enhance drug discovery. However, current models exhibit several limitations, such as the generation of invalid molecular SMILES, underutilization of contextual information, and equal t...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Pei, Qizhi, Zhang, Wei, Zhu, Jinhua, Wu, Kehan, Gao, Kaiyuan, Wu, Lijun, Xia, Yingce, Yan, Rui
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Pei, Qizhi
Zhang, Wei
Zhu, Jinhua
Wu, Kehan
Gao, Kaiyuan
Wu, Lijun
Xia, Yingce
Yan, Rui
description Recent advancements in biological research leverage the integration of molecules, proteins, and natural language to enhance drug discovery. However, current models exhibit several limitations, such as the generation of invalid molecular SMILES, underutilization of contextual information, and equal treatment of structured and unstructured knowledge. To address these issues, we propose $\mathbf{BioT5}$, a comprehensive pre-training framework that enriches cross-modal integration in biology with chemical knowledge and natural language associations. $\mathbf{BioT5}$ utilizes SELFIES for $100%$ robust molecular representations and extracts knowledge from the surrounding context of bio-entities in unstructured biological literature. Furthermore, $\mathbf{BioT5}$ distinguishes between structured and unstructured knowledge, leading to more effective utilization of information. After fine-tuning, BioT5 shows superior performance across a wide range of tasks, demonstrating its strong capability of capturing underlying relations and properties of bio-entities. Our code is available at $\href{https://github.com/QizhiPei/BioT5}{Github}$.
doi_str_mv 10.48550/arxiv.2310.07276
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2310_07276</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2310_07276</sourcerecordid><originalsourceid>FETCH-LOGICAL-a676-ab03343a48c2380733b83a400db46190bf56ed9c0ed944f80309abf8b4b98b783</originalsourceid><addsrcrecordid>eNotj8FOhDAURbtxYUY_wJX9AcY3tEBxN5JRJxLdsCevUEoTaE3LOM7fW9HNu3knNzc5hNztYMtFlsED-m_ztU1ZBFCkRX5N5ifjmuyRHqw33WisppV3ISSz63GiR7so7XExzlJjaexOTl_o2SwjrUY1my6W3qw7T6rXiqLt6TsuJx9pjVafMMJ9CK4z60a4IVcDTkHd_ueGNM-HpnpN6o-XY7WvE8yLPEEJjHGGXHQpE1AwJkX8AHrJ810Jcshy1ZcdxMP5IIBBiXIQkstSyEKwDbn_m119209vZvSX9te7Xb3ZDy4rU4E</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations</title><source>arXiv.org</source><creator>Pei, Qizhi ; Zhang, Wei ; Zhu, Jinhua ; Wu, Kehan ; Gao, Kaiyuan ; Wu, Lijun ; Xia, Yingce ; Yan, Rui</creator><creatorcontrib>Pei, Qizhi ; Zhang, Wei ; Zhu, Jinhua ; Wu, Kehan ; Gao, Kaiyuan ; Wu, Lijun ; Xia, Yingce ; Yan, Rui</creatorcontrib><description>Recent advancements in biological research leverage the integration of molecules, proteins, and natural language to enhance drug discovery. However, current models exhibit several limitations, such as the generation of invalid molecular SMILES, underutilization of contextual information, and equal treatment of structured and unstructured knowledge. To address these issues, we propose $\mathbf{BioT5}$, a comprehensive pre-training framework that enriches cross-modal integration in biology with chemical knowledge and natural language associations. $\mathbf{BioT5}$ utilizes SELFIES for $100%$ robust molecular representations and extracts knowledge from the surrounding context of bio-entities in unstructured biological literature. Furthermore, $\mathbf{BioT5}$ distinguishes between structured and unstructured knowledge, leading to more effective utilization of information. After fine-tuning, BioT5 shows superior performance across a wide range of tasks, demonstrating its strong capability of capturing underlying relations and properties of bio-entities. Our code is available at $\href{https://github.com/QizhiPei/BioT5}{Github}$.</description><identifier>DOI: 10.48550/arxiv.2310.07276</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language ; Computer Science - Learning ; Quantitative Biology - Biomolecules</subject><creationdate>2023-10</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2310.07276$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2310.07276$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Pei, Qizhi</creatorcontrib><creatorcontrib>Zhang, Wei</creatorcontrib><creatorcontrib>Zhu, Jinhua</creatorcontrib><creatorcontrib>Wu, Kehan</creatorcontrib><creatorcontrib>Gao, Kaiyuan</creatorcontrib><creatorcontrib>Wu, Lijun</creatorcontrib><creatorcontrib>Xia, Yingce</creatorcontrib><creatorcontrib>Yan, Rui</creatorcontrib><title>BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations</title><description>Recent advancements in biological research leverage the integration of molecules, proteins, and natural language to enhance drug discovery. However, current models exhibit several limitations, such as the generation of invalid molecular SMILES, underutilization of contextual information, and equal treatment of structured and unstructured knowledge. To address these issues, we propose $\mathbf{BioT5}$, a comprehensive pre-training framework that enriches cross-modal integration in biology with chemical knowledge and natural language associations. $\mathbf{BioT5}$ utilizes SELFIES for $100%$ robust molecular representations and extracts knowledge from the surrounding context of bio-entities in unstructured biological literature. Furthermore, $\mathbf{BioT5}$ distinguishes between structured and unstructured knowledge, leading to more effective utilization of information. After fine-tuning, BioT5 shows superior performance across a wide range of tasks, demonstrating its strong capability of capturing underlying relations and properties of bio-entities. Our code is available at $\href{https://github.com/QizhiPei/BioT5}{Github}$.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Learning</subject><subject>Quantitative Biology - Biomolecules</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8FOhDAURbtxYUY_wJX9AcY3tEBxN5JRJxLdsCevUEoTaE3LOM7fW9HNu3knNzc5hNztYMtFlsED-m_ztU1ZBFCkRX5N5ifjmuyRHqw33WisppV3ISSz63GiR7so7XExzlJjaexOTl_o2SwjrUY1my6W3qw7T6rXiqLt6TsuJx9pjVafMMJ9CK4z60a4IVcDTkHd_ueGNM-HpnpN6o-XY7WvE8yLPEEJjHGGXHQpE1AwJkX8AHrJ810Jcshy1ZcdxMP5IIBBiXIQkstSyEKwDbn_m119209vZvSX9te7Xb3ZDy4rU4E</recordid><startdate>20231011</startdate><enddate>20231011</enddate><creator>Pei, Qizhi</creator><creator>Zhang, Wei</creator><creator>Zhu, Jinhua</creator><creator>Wu, Kehan</creator><creator>Gao, Kaiyuan</creator><creator>Wu, Lijun</creator><creator>Xia, Yingce</creator><creator>Yan, Rui</creator><scope>AKY</scope><scope>ALC</scope><scope>GOX</scope></search><sort><creationdate>20231011</creationdate><title>BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations</title><author>Pei, Qizhi ; Zhang, Wei ; Zhu, Jinhua ; Wu, Kehan ; Gao, Kaiyuan ; Wu, Lijun ; Xia, Yingce ; Yan, Rui</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a676-ab03343a48c2380733b83a400db46190bf56ed9c0ed944f80309abf8b4b98b783</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Learning</topic><topic>Quantitative Biology - Biomolecules</topic><toplevel>online_resources</toplevel><creatorcontrib>Pei, Qizhi</creatorcontrib><creatorcontrib>Zhang, Wei</creatorcontrib><creatorcontrib>Zhu, Jinhua</creatorcontrib><creatorcontrib>Wu, Kehan</creatorcontrib><creatorcontrib>Gao, Kaiyuan</creatorcontrib><creatorcontrib>Wu, Lijun</creatorcontrib><creatorcontrib>Xia, Yingce</creatorcontrib><creatorcontrib>Yan, Rui</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv Quantitative Biology</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Pei, Qizhi</au><au>Zhang, Wei</au><au>Zhu, Jinhua</au><au>Wu, Kehan</au><au>Gao, Kaiyuan</au><au>Wu, Lijun</au><au>Xia, Yingce</au><au>Yan, Rui</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations</atitle><date>2023-10-11</date><risdate>2023</risdate><abstract>Recent advancements in biological research leverage the integration of molecules, proteins, and natural language to enhance drug discovery. However, current models exhibit several limitations, such as the generation of invalid molecular SMILES, underutilization of contextual information, and equal treatment of structured and unstructured knowledge. To address these issues, we propose $\mathbf{BioT5}$, a comprehensive pre-training framework that enriches cross-modal integration in biology with chemical knowledge and natural language associations. $\mathbf{BioT5}$ utilizes SELFIES for $100%$ robust molecular representations and extracts knowledge from the surrounding context of bio-entities in unstructured biological literature. Furthermore, $\mathbf{BioT5}$ distinguishes between structured and unstructured knowledge, leading to more effective utilization of information. After fine-tuning, BioT5 shows superior performance across a wide range of tasks, demonstrating its strong capability of capturing underlying relations and properties of bio-entities. Our code is available at $\href{https://github.com/QizhiPei/BioT5}{Github}$.</abstract><doi>10.48550/arxiv.2310.07276</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2310.07276
ispartof
issn
language eng
recordid cdi_arxiv_primary_2310_07276
source arXiv.org
subjects Computer Science - Artificial Intelligence
Computer Science - Computation and Language
Computer Science - Learning
Quantitative Biology - Biomolecules
title BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T18%3A21%3A41IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=BioT5:%20Enriching%20Cross-modal%20Integration%20in%20Biology%20with%20Chemical%20Knowledge%20and%20Natural%20Language%20Associations&rft.au=Pei,%20Qizhi&rft.date=2023-10-11&rft_id=info:doi/10.48550/arxiv.2310.07276&rft_dat=%3Carxiv_GOX%3E2310_07276%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true