Multi-modal Transfer Learning between Biological Foundation Models

Biological sequences encode fundamental instructions for the building blocks of life, in the form of DNA, RNA, and proteins. Modeling these sequences is key to understand disease mechanisms and is an active research area in computational biology. Recently, Large Language Models have shown great prom...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Garau-Luis, Juan Jose, Bordes, Patrick, Gonzalez, Liam, Roller, Masa, de Almeida, Bernardo P, Hexemer, Lorenz, Blum, Christopher, Laurent, Stefan, Grzegorzewski, Jan, Lang, Maren, Pierrot, Thomas, Richard, Guillaume
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Garau-Luis, Juan Jose Bordes, Patrick Gonzalez, Liam Roller, Masa de Almeida, Bernardo P Hexemer, Lorenz Blum, Christopher Laurent, Stefan Grzegorzewski, Jan Lang, Maren Pierrot, Thomas Richard, Guillaume
description	Biological sequences encode fundamental instructions for the building blocks of life, in the form of DNA, RNA, and proteins. Modeling these sequences is key to understand disease mechanisms and is an active research area in computational biology. Recently, Large Language Models have shown great promise in solving certain biological tasks but current approaches are limited to a single sequence modality (DNA, RNA, or protein). Key problems in genomics intrinsically involve multiple modalities, but it remains unclear how to adapt general-purpose sequence models to those cases. In this work we propose a multi-modal model that connects DNA, RNA, and proteins by leveraging information from different pre-trained modality-specific encoders. We demonstrate its capabilities by applying it to the largely unsolved problem of predicting how multiple RNA transcript isoforms originate from the same gene (i.e. same DNA sequence) and map to different transcription expression levels across various human tissues. We show that our model, dubbed IsoFormer, is able to accurately predict differential transcript expression, outperforming existing methods and leveraging the use of multiple modalities. Our framework also achieves efficient transfer knowledge from the encoders pre-training as well as in between modalities. We open-source our model, paving the way for new multi-modal gene expression approaches.
doi_str_mv	10.48550/arxiv.2406.14150
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2406_14150</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2406_14150</sourcerecordid><originalsourceid>FETCH-LOGICAL-a670-d087547080a6e4801536ce6a7bbb8370203320c873aba7597099bf6189c8fba53</originalsourceid><addsrcrecordid>eNotz71OwzAYhWEvHVDLBTDhG0j6Of7NSCsKSKm6ZI8-J05lybUrJ6Xl7oHCdJZXR3oIeWJQCiMlrDHf_GdZCVAlE0zCA9nsL2H2xSkNGGibMU6jy7RxmKOPR2rdfHUu0o1PIR19_xPt0iUOOPsU6T4NLkwrshgxTO7xf5ek3b222_eiObx9bF-aApWGYgCjpdBgAJUTBpjkqncKtbXWcA0VcF5BbzRHi1rWGurajoqZujejRcmX5Pnv9o7oztmfMH91v5jujuHfdblDnw</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Multi-modal Transfer Learning between Biological Foundation Models</title><source>arXiv.org</source><creator>Garau-Luis, Juan Jose ; Bordes, Patrick ; Gonzalez, Liam ; Roller, Masa ; de Almeida, Bernardo P ; Hexemer, Lorenz ; Blum, Christopher ; Laurent, Stefan ; Grzegorzewski, Jan ; Lang, Maren ; Pierrot, Thomas ; Richard, Guillaume</creator><creatorcontrib>Garau-Luis, Juan Jose ; Bordes, Patrick ; Gonzalez, Liam ; Roller, Masa ; de Almeida, Bernardo P ; Hexemer, Lorenz ; Blum, Christopher ; Laurent, Stefan ; Grzegorzewski, Jan ; Lang, Maren ; Pierrot, Thomas ; Richard, Guillaume</creatorcontrib><description>Biological sequences encode fundamental instructions for the building blocks of life, in the form of DNA, RNA, and proteins. Modeling these sequences is key to understand disease mechanisms and is an active research area in computational biology. Recently, Large Language Models have shown great promise in solving certain biological tasks but current approaches are limited to a single sequence modality (DNA, RNA, or protein). Key problems in genomics intrinsically involve multiple modalities, but it remains unclear how to adapt general-purpose sequence models to those cases. In this work we propose a multi-modal model that connects DNA, RNA, and proteins by leveraging information from different pre-trained modality-specific encoders. We demonstrate its capabilities by applying it to the largely unsolved problem of predicting how multiple RNA transcript isoforms originate from the same gene (i.e. same DNA sequence) and map to different transcription expression levels across various human tissues. We show that our model, dubbed IsoFormer, is able to accurately predict differential transcript expression, outperforming existing methods and leveraging the use of multiple modalities. Our framework also achieves efficient transfer knowledge from the encoders pre-training as well as in between modalities. We open-source our model, paving the way for new multi-modal gene expression approaches.</description><identifier>DOI: 10.48550/arxiv.2406.14150</identifier><language>eng</language><subject>Computer Science - Learning</subject><creationdate>2024-06</creationdate><rights>http://creativecommons.org/licenses/by-nc-sa/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,781,886</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2406.14150$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2406.14150$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Garau-Luis, Juan Jose</creatorcontrib><creatorcontrib>Bordes, Patrick</creatorcontrib><creatorcontrib>Gonzalez, Liam</creatorcontrib><creatorcontrib>Roller, Masa</creatorcontrib><creatorcontrib>de Almeida, Bernardo P</creatorcontrib><creatorcontrib>Hexemer, Lorenz</creatorcontrib><creatorcontrib>Blum, Christopher</creatorcontrib><creatorcontrib>Laurent, Stefan</creatorcontrib><creatorcontrib>Grzegorzewski, Jan</creatorcontrib><creatorcontrib>Lang, Maren</creatorcontrib><creatorcontrib>Pierrot, Thomas</creatorcontrib><creatorcontrib>Richard, Guillaume</creatorcontrib><title>Multi-modal Transfer Learning between Biological Foundation Models</title><description>Biological sequences encode fundamental instructions for the building blocks of life, in the form of DNA, RNA, and proteins. Modeling these sequences is key to understand disease mechanisms and is an active research area in computational biology. Recently, Large Language Models have shown great promise in solving certain biological tasks but current approaches are limited to a single sequence modality (DNA, RNA, or protein). Key problems in genomics intrinsically involve multiple modalities, but it remains unclear how to adapt general-purpose sequence models to those cases. In this work we propose a multi-modal model that connects DNA, RNA, and proteins by leveraging information from different pre-trained modality-specific encoders. We demonstrate its capabilities by applying it to the largely unsolved problem of predicting how multiple RNA transcript isoforms originate from the same gene (i.e. same DNA sequence) and map to different transcription expression levels across various human tissues. We show that our model, dubbed IsoFormer, is able to accurately predict differential transcript expression, outperforming existing methods and leveraging the use of multiple modalities. Our framework also achieves efficient transfer knowledge from the encoders pre-training as well as in between modalities. We open-source our model, paving the way for new multi-modal gene expression approaches.</description><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz71OwzAYhWEvHVDLBTDhG0j6Of7NSCsKSKm6ZI8-J05lybUrJ6Xl7oHCdJZXR3oIeWJQCiMlrDHf_GdZCVAlE0zCA9nsL2H2xSkNGGibMU6jy7RxmKOPR2rdfHUu0o1PIR19_xPt0iUOOPsU6T4NLkwrshgxTO7xf5ek3b222_eiObx9bF-aApWGYgCjpdBgAJUTBpjkqncKtbXWcA0VcF5BbzRHi1rWGurajoqZujejRcmX5Pnv9o7oztmfMH91v5jujuHfdblDnw</recordid><startdate>20240620</startdate><enddate>20240620</enddate><creator>Garau-Luis, Juan Jose</creator><creator>Bordes, Patrick</creator><creator>Gonzalez, Liam</creator><creator>Roller, Masa</creator><creator>de Almeida, Bernardo P</creator><creator>Hexemer, Lorenz</creator><creator>Blum, Christopher</creator><creator>Laurent, Stefan</creator><creator>Grzegorzewski, Jan</creator><creator>Lang, Maren</creator><creator>Pierrot, Thomas</creator><creator>Richard, Guillaume</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240620</creationdate><title>Multi-modal Transfer Learning between Biological Foundation Models</title><author>Garau-Luis, Juan Jose ; Bordes, Patrick ; Gonzalez, Liam ; Roller, Masa ; de Almeida, Bernardo P ; Hexemer, Lorenz ; Blum, Christopher ; Laurent, Stefan ; Grzegorzewski, Jan ; Lang, Maren ; Pierrot, Thomas ; Richard, Guillaume</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a670-d087547080a6e4801536ce6a7bbb8370203320c873aba7597099bf6189c8fba53</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Garau-Luis, Juan Jose</creatorcontrib><creatorcontrib>Bordes, Patrick</creatorcontrib><creatorcontrib>Gonzalez, Liam</creatorcontrib><creatorcontrib>Roller, Masa</creatorcontrib><creatorcontrib>de Almeida, Bernardo P</creatorcontrib><creatorcontrib>Hexemer, Lorenz</creatorcontrib><creatorcontrib>Blum, Christopher</creatorcontrib><creatorcontrib>Laurent, Stefan</creatorcontrib><creatorcontrib>Grzegorzewski, Jan</creatorcontrib><creatorcontrib>Lang, Maren</creatorcontrib><creatorcontrib>Pierrot, Thomas</creatorcontrib><creatorcontrib>Richard, Guillaume</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Garau-Luis, Juan Jose</au><au>Bordes, Patrick</au><au>Gonzalez, Liam</au><au>Roller, Masa</au><au>de Almeida, Bernardo P</au><au>Hexemer, Lorenz</au><au>Blum, Christopher</au><au>Laurent, Stefan</au><au>Grzegorzewski, Jan</au><au>Lang, Maren</au><au>Pierrot, Thomas</au><au>Richard, Guillaume</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Multi-modal Transfer Learning between Biological Foundation Models</atitle><date>2024-06-20</date><risdate>2024</risdate><abstract>Biological sequences encode fundamental instructions for the building blocks of life, in the form of DNA, RNA, and proteins. Modeling these sequences is key to understand disease mechanisms and is an active research area in computational biology. Recently, Large Language Models have shown great promise in solving certain biological tasks but current approaches are limited to a single sequence modality (DNA, RNA, or protein). Key problems in genomics intrinsically involve multiple modalities, but it remains unclear how to adapt general-purpose sequence models to those cases. In this work we propose a multi-modal model that connects DNA, RNA, and proteins by leveraging information from different pre-trained modality-specific encoders. We demonstrate its capabilities by applying it to the largely unsolved problem of predicting how multiple RNA transcript isoforms originate from the same gene (i.e. same DNA sequence) and map to different transcription expression levels across various human tissues. We show that our model, dubbed IsoFormer, is able to accurately predict differential transcript expression, outperforming existing methods and leveraging the use of multiple modalities. Our framework also achieves efficient transfer knowledge from the encoders pre-training as well as in between modalities. We open-source our model, paving the way for new multi-modal gene expression approaches.</abstract><doi>10.48550/arxiv.2406.14150</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2406.14150
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2406_14150
source	arXiv.org
subjects	Computer Science - Learning
title	Multi-modal Transfer Learning between Biological Foundation Models
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-17T01%3A20%3A22IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Multi-modal%20Transfer%20Learning%20between%20Biological%20Foundation%20Models&rft.au=Garau-Luis,%20Juan%20Jose&rft.date=2024-06-20&rft_id=info:doi/10.48550/arxiv.2406.14150&rft_dat=%3Carxiv_GOX%3E2406_14150%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true