Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward

Recent research has shown that fine-tuning diffusion models (DMs) with arbitrary rewards, including non-differentiable ones, is feasible with reinforcement learning (RL) techniques, enabling flexible model alignment. However, applying existing RL methods to timestep-distilled DMs is challenging for...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Jia, Zhiwei, Nan, Yuesong, Zhao, Huixi, Liu, Gengdai
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Jia, Zhiwei
Nan, Yuesong
Zhao, Huixi
Liu, Gengdai
description Recent research has shown that fine-tuning diffusion models (DMs) with arbitrary rewards, including non-differentiable ones, is feasible with reinforcement learning (RL) techniques, enabling flexible model alignment. However, applying existing RL methods to timestep-distilled DMs is challenging for ultra-fast ($\le2$-step) image generation. Our analysis suggests several limitations of policy-based RL methods such as PPO or DPO toward this goal. Based on the insights, we propose fine-tuning DMs with learned differentiable surrogate rewards. Our method, named LaSRO, learns surrogate reward models in the latent space of SDXL to convert arbitrary rewards into differentiable ones for efficient reward gradient guidance. LaSRO leverages pre-trained latent DMs for reward modeling and specifically targets image generation $\le2$ steps for reward optimization, enhancing generalizability and efficiency. LaSRO is effective and stable for improving ultra-fast image generation with different reward objectives, outperforming popular RL methods including PPO and DPO. We further show LaSRO's connection to value-based RL, providing theoretical insights. See our webpage at https://sites.google.com/view/lasro.
doi_str_mv 10.48550/arxiv.2411.15247
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2411_15247</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2411_15247</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2411_152473</originalsourceid><addsrcrecordid>eNqFjr0OgkAQhK-xMOoDWLkvcMghRHuVWGAj9GSVPbIJHuT407cX0N5qJjOTySfEWrmOfwgCd4v2xZ3j-Uo5KvD8_VzoG_VoMwjZkExawyaHpC9l3FAFJ9a6rbk0cC0zKmroGCEitNNsbMmSaRjvBUGEzeBlXOGDIG6tLfMhge__Usw0FjWtfroQm_CcHC9yIkory0-073QkSyey3f_FB-pHRGg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward</title><source>arXiv.org</source><creator>Jia, Zhiwei ; Nan, Yuesong ; Zhao, Huixi ; Liu, Gengdai</creator><creatorcontrib>Jia, Zhiwei ; Nan, Yuesong ; Zhao, Huixi ; Liu, Gengdai</creatorcontrib><description>Recent research has shown that fine-tuning diffusion models (DMs) with arbitrary rewards, including non-differentiable ones, is feasible with reinforcement learning (RL) techniques, enabling flexible model alignment. However, applying existing RL methods to timestep-distilled DMs is challenging for ultra-fast ($\le2$-step) image generation. Our analysis suggests several limitations of policy-based RL methods such as PPO or DPO toward this goal. Based on the insights, we propose fine-tuning DMs with learned differentiable surrogate rewards. Our method, named LaSRO, learns surrogate reward models in the latent space of SDXL to convert arbitrary rewards into differentiable ones for efficient reward gradient guidance. LaSRO leverages pre-trained latent DMs for reward modeling and specifically targets image generation $\le2$ steps for reward optimization, enhancing generalizability and efficiency. LaSRO is effective and stable for improving ultra-fast image generation with different reward objectives, outperforming popular RL methods including PPO and DPO. We further show LaSRO's connection to value-based RL, providing theoretical insights. See our webpage at https://sites.google.com/view/lasro.</description><identifier>DOI: 10.48550/arxiv.2411.15247</identifier><language>eng</language><subject>Computer Science - Learning</subject><creationdate>2024-11</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2411.15247$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2411.15247$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Jia, Zhiwei</creatorcontrib><creatorcontrib>Nan, Yuesong</creatorcontrib><creatorcontrib>Zhao, Huixi</creatorcontrib><creatorcontrib>Liu, Gengdai</creatorcontrib><title>Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward</title><description>Recent research has shown that fine-tuning diffusion models (DMs) with arbitrary rewards, including non-differentiable ones, is feasible with reinforcement learning (RL) techniques, enabling flexible model alignment. However, applying existing RL methods to timestep-distilled DMs is challenging for ultra-fast ($\le2$-step) image generation. Our analysis suggests several limitations of policy-based RL methods such as PPO or DPO toward this goal. Based on the insights, we propose fine-tuning DMs with learned differentiable surrogate rewards. Our method, named LaSRO, learns surrogate reward models in the latent space of SDXL to convert arbitrary rewards into differentiable ones for efficient reward gradient guidance. LaSRO leverages pre-trained latent DMs for reward modeling and specifically targets image generation $\le2$ steps for reward optimization, enhancing generalizability and efficiency. LaSRO is effective and stable for improving ultra-fast image generation with different reward objectives, outperforming popular RL methods including PPO and DPO. We further show LaSRO's connection to value-based RL, providing theoretical insights. See our webpage at https://sites.google.com/view/lasro.</description><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNqFjr0OgkAQhK-xMOoDWLkvcMghRHuVWGAj9GSVPbIJHuT407cX0N5qJjOTySfEWrmOfwgCd4v2xZ3j-Uo5KvD8_VzoG_VoMwjZkExawyaHpC9l3FAFJ9a6rbk0cC0zKmroGCEitNNsbMmSaRjvBUGEzeBlXOGDIG6tLfMhge__Usw0FjWtfroQm_CcHC9yIkory0-073QkSyey3f_FB-pHRGg</recordid><startdate>20241122</startdate><enddate>20241122</enddate><creator>Jia, Zhiwei</creator><creator>Nan, Yuesong</creator><creator>Zhao, Huixi</creator><creator>Liu, Gengdai</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20241122</creationdate><title>Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward</title><author>Jia, Zhiwei ; Nan, Yuesong ; Zhao, Huixi ; Liu, Gengdai</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2411_152473</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Jia, Zhiwei</creatorcontrib><creatorcontrib>Nan, Yuesong</creatorcontrib><creatorcontrib>Zhao, Huixi</creatorcontrib><creatorcontrib>Liu, Gengdai</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Jia, Zhiwei</au><au>Nan, Yuesong</au><au>Zhao, Huixi</au><au>Liu, Gengdai</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward</atitle><date>2024-11-22</date><risdate>2024</risdate><abstract>Recent research has shown that fine-tuning diffusion models (DMs) with arbitrary rewards, including non-differentiable ones, is feasible with reinforcement learning (RL) techniques, enabling flexible model alignment. However, applying existing RL methods to timestep-distilled DMs is challenging for ultra-fast ($\le2$-step) image generation. Our analysis suggests several limitations of policy-based RL methods such as PPO or DPO toward this goal. Based on the insights, we propose fine-tuning DMs with learned differentiable surrogate rewards. Our method, named LaSRO, learns surrogate reward models in the latent space of SDXL to convert arbitrary rewards into differentiable ones for efficient reward gradient guidance. LaSRO leverages pre-trained latent DMs for reward modeling and specifically targets image generation $\le2$ steps for reward optimization, enhancing generalizability and efficiency. LaSRO is effective and stable for improving ultra-fast image generation with different reward objectives, outperforming popular RL methods including PPO and DPO. We further show LaSRO's connection to value-based RL, providing theoretical insights. See our webpage at https://sites.google.com/view/lasro.</abstract><doi>10.48550/arxiv.2411.15247</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2411.15247
ispartof
issn
language eng
recordid cdi_arxiv_primary_2411_15247
source arXiv.org
subjects Computer Science - Learning
title Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-06T07%3A03%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Reward%20Fine-Tuning%20Two-Step%20Diffusion%20Models%20via%20Learning%20Differentiable%20Latent-Space%20Surrogate%20Reward&rft.au=Jia,%20Zhiwei&rft.date=2024-11-22&rft_id=info:doi/10.48550/arxiv.2411.15247&rft_dat=%3Carxiv_GOX%3E2411_15247%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true