Dirichlet Diffusion Score Model for Biological Sequence Generation

Designing biological sequences is an important challenge that requires satisfying complex constraints and thus is a natural problem to address with deep generative modeling. Diffusion generative models have achieved considerable success in many applications. Score-based generative stochastic differe...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Avdeyev, Pavel, Shi, Chenlai, Tan, Yuhao, Dudnyk, Kseniia, Zhou, Jian
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Avdeyev, Pavel
Shi, Chenlai
Tan, Yuhao
Dudnyk, Kseniia
Zhou, Jian
description Designing biological sequences is an important challenge that requires satisfying complex constraints and thus is a natural problem to address with deep generative modeling. Diffusion generative models have achieved considerable success in many applications. Score-based generative stochastic differential equations (SDE) model is a continuous-time diffusion model framework that enjoys many benefits, but the originally proposed SDEs are not naturally designed for modeling discrete data. To develop generative SDE models for discrete data such as biological sequences, here we introduce a diffusion process defined in the probability simplex space with stationary distribution being the Dirichlet distribution. This makes diffusion in continuous space natural for modeling discrete data. We refer to this approach as Dirchlet diffusion score model. We demonstrate that this technique can generate samples that satisfy hard constraints using a Sudoku generation task. This generative model can also solve Sudoku, including hard puzzles, without additional training. Finally, we applied this approach to develop the first human promoter DNA sequence design model and showed that designed sequences share similar properties with natural promoter sequences.
doi_str_mv 10.48550/arxiv.2305.10699
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2305_10699</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2305_10699</sourcerecordid><originalsourceid>FETCH-LOGICAL-a679-d5175a3f544b5ab5bc3c292dc334558336abf5332048f6b4ca85144f894b1723</originalsourceid><addsrcrecordid>eNotz7FOwzAUhWEvDKjwAEz4BRJsX9_EHmkLBamIIezRtWODJROD2yJ4e6AwneXXkT7GLqRotUEUV1Q_00erQGArRWftKVuuU03-JYc9X6cYD7tUZj74UgN_KFPIPJbKl6nk8pw8ZT6E90OYfeCbMIdK-5_8jJ1Eyrtw_r8LNtzePK3umu3j5n51vW2o620zoeyRIKLWDsmh8-CVVZMH0IgGoCMXEUAJbWLntCeDUutorHayV7Bgl3-vR8P4VtMr1a_x1zIeLfANNjBDZg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Dirichlet Diffusion Score Model for Biological Sequence Generation</title><source>arXiv.org</source><creator>Avdeyev, Pavel ; Shi, Chenlai ; Tan, Yuhao ; Dudnyk, Kseniia ; Zhou, Jian</creator><creatorcontrib>Avdeyev, Pavel ; Shi, Chenlai ; Tan, Yuhao ; Dudnyk, Kseniia ; Zhou, Jian</creatorcontrib><description>Designing biological sequences is an important challenge that requires satisfying complex constraints and thus is a natural problem to address with deep generative modeling. Diffusion generative models have achieved considerable success in many applications. Score-based generative stochastic differential equations (SDE) model is a continuous-time diffusion model framework that enjoys many benefits, but the originally proposed SDEs are not naturally designed for modeling discrete data. To develop generative SDE models for discrete data such as biological sequences, here we introduce a diffusion process defined in the probability simplex space with stationary distribution being the Dirichlet distribution. This makes diffusion in continuous space natural for modeling discrete data. We refer to this approach as Dirchlet diffusion score model. We demonstrate that this technique can generate samples that satisfy hard constraints using a Sudoku generation task. This generative model can also solve Sudoku, including hard puzzles, without additional training. Finally, we applied this approach to develop the first human promoter DNA sequence design model and showed that designed sequences share similar properties with natural promoter sequences.</description><identifier>DOI: 10.48550/arxiv.2305.10699</identifier><language>eng</language><subject>Computer Science - Learning ; Quantitative Biology - Genomics ; Quantitative Biology - Quantitative Methods</subject><creationdate>2023-05</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2305.10699$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2305.10699$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Avdeyev, Pavel</creatorcontrib><creatorcontrib>Shi, Chenlai</creatorcontrib><creatorcontrib>Tan, Yuhao</creatorcontrib><creatorcontrib>Dudnyk, Kseniia</creatorcontrib><creatorcontrib>Zhou, Jian</creatorcontrib><title>Dirichlet Diffusion Score Model for Biological Sequence Generation</title><description>Designing biological sequences is an important challenge that requires satisfying complex constraints and thus is a natural problem to address with deep generative modeling. Diffusion generative models have achieved considerable success in many applications. Score-based generative stochastic differential equations (SDE) model is a continuous-time diffusion model framework that enjoys many benefits, but the originally proposed SDEs are not naturally designed for modeling discrete data. To develop generative SDE models for discrete data such as biological sequences, here we introduce a diffusion process defined in the probability simplex space with stationary distribution being the Dirichlet distribution. This makes diffusion in continuous space natural for modeling discrete data. We refer to this approach as Dirchlet diffusion score model. We demonstrate that this technique can generate samples that satisfy hard constraints using a Sudoku generation task. This generative model can also solve Sudoku, including hard puzzles, without additional training. Finally, we applied this approach to develop the first human promoter DNA sequence design model and showed that designed sequences share similar properties with natural promoter sequences.</description><subject>Computer Science - Learning</subject><subject>Quantitative Biology - Genomics</subject><subject>Quantitative Biology - Quantitative Methods</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz7FOwzAUhWEvDKjwAEz4BRJsX9_EHmkLBamIIezRtWODJROD2yJ4e6AwneXXkT7GLqRotUEUV1Q_00erQGArRWftKVuuU03-JYc9X6cYD7tUZj74UgN_KFPIPJbKl6nk8pw8ZT6E90OYfeCbMIdK-5_8jJ1Eyrtw_r8LNtzePK3umu3j5n51vW2o620zoeyRIKLWDsmh8-CVVZMH0IgGoCMXEUAJbWLntCeDUutorHayV7Bgl3-vR8P4VtMr1a_x1zIeLfANNjBDZg</recordid><startdate>20230518</startdate><enddate>20230518</enddate><creator>Avdeyev, Pavel</creator><creator>Shi, Chenlai</creator><creator>Tan, Yuhao</creator><creator>Dudnyk, Kseniia</creator><creator>Zhou, Jian</creator><scope>AKY</scope><scope>ALC</scope><scope>GOX</scope></search><sort><creationdate>20230518</creationdate><title>Dirichlet Diffusion Score Model for Biological Sequence Generation</title><author>Avdeyev, Pavel ; Shi, Chenlai ; Tan, Yuhao ; Dudnyk, Kseniia ; Zhou, Jian</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a679-d5175a3f544b5ab5bc3c292dc334558336abf5332048f6b4ca85144f894b1723</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Learning</topic><topic>Quantitative Biology - Genomics</topic><topic>Quantitative Biology - Quantitative Methods</topic><toplevel>online_resources</toplevel><creatorcontrib>Avdeyev, Pavel</creatorcontrib><creatorcontrib>Shi, Chenlai</creatorcontrib><creatorcontrib>Tan, Yuhao</creatorcontrib><creatorcontrib>Dudnyk, Kseniia</creatorcontrib><creatorcontrib>Zhou, Jian</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv Quantitative Biology</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Avdeyev, Pavel</au><au>Shi, Chenlai</au><au>Tan, Yuhao</au><au>Dudnyk, Kseniia</au><au>Zhou, Jian</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Dirichlet Diffusion Score Model for Biological Sequence Generation</atitle><date>2023-05-18</date><risdate>2023</risdate><abstract>Designing biological sequences is an important challenge that requires satisfying complex constraints and thus is a natural problem to address with deep generative modeling. Diffusion generative models have achieved considerable success in many applications. Score-based generative stochastic differential equations (SDE) model is a continuous-time diffusion model framework that enjoys many benefits, but the originally proposed SDEs are not naturally designed for modeling discrete data. To develop generative SDE models for discrete data such as biological sequences, here we introduce a diffusion process defined in the probability simplex space with stationary distribution being the Dirichlet distribution. This makes diffusion in continuous space natural for modeling discrete data. We refer to this approach as Dirchlet diffusion score model. We demonstrate that this technique can generate samples that satisfy hard constraints using a Sudoku generation task. This generative model can also solve Sudoku, including hard puzzles, without additional training. Finally, we applied this approach to develop the first human promoter DNA sequence design model and showed that designed sequences share similar properties with natural promoter sequences.</abstract><doi>10.48550/arxiv.2305.10699</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2305.10699
ispartof
issn
language eng
recordid cdi_arxiv_primary_2305_10699
source arXiv.org
subjects Computer Science - Learning
Quantitative Biology - Genomics
Quantitative Biology - Quantitative Methods
title Dirichlet Diffusion Score Model for Biological Sequence Generation
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T13%3A34%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Dirichlet%20Diffusion%20Score%20Model%20for%20Biological%20Sequence%20Generation&rft.au=Avdeyev,%20Pavel&rft.date=2023-05-18&rft_id=info:doi/10.48550/arxiv.2305.10699&rft_dat=%3Carxiv_GOX%3E2305_10699%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true