Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward

Recent research has shown that fine-tuning diffusion models (DMs) with arbitrary rewards, including non-differentiable ones, is feasible with reinforcement learning (RL) techniques, enabling flexible model alignment. However, applying existing RL methods to timestep-distilled DMs is challenging for...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Jia, Zhiwei, Nan, Yuesong, Zhao, Huixi, Liu, Gengdai
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Jia, Zhiwei Nan, Yuesong Zhao, Huixi Liu, Gengdai
description	Recent research has shown that fine-tuning diffusion models (DMs) with arbitrary rewards, including non-differentiable ones, is feasible with reinforcement learning (RL) techniques, enabling flexible model alignment. However, applying existing RL methods to timestep-distilled DMs is challenging for ultra-fast ($\le2$-step) image generation. Our analysis suggests several limitations of policy-based RL methods such as PPO or DPO toward this goal. Based on the insights, we propose fine-tuning DMs with learned differentiable surrogate rewards. Our method, named LaSRO, learns surrogate reward models in the latent space of SDXL to convert arbitrary rewards into differentiable ones for efficient reward gradient guidance. LaSRO leverages pre-trained latent DMs for reward modeling and specifically targets image generation $\le2$ steps for reward optimization, enhancing generalizability and efficiency. LaSRO is effective and stable for improving ultra-fast image generation with different reward objectives, outperforming popular RL methods including PPO and DPO. We further show LaSRO's connection to value-based RL, providing theoretical insights. See our webpage at https://sites.google.com/view/lasro.
doi_str_mv	10.48550/arxiv.2411.15247
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2411_15247</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2411_15247</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2411_152473</originalsourceid><addsrcrecordid>eNqFjr0OgkAQhK-xMOoDWLkvcMghRHuVWGAj9GSVPbIJHuT407cX0N5qJjOTySfEWrmOfwgCd4v2xZ3j-Uo5KvD8_VzoG_VoMwjZkExawyaHpC9l3FAFJ9a6rbk0cC0zKmroGCEitNNsbMmSaRjvBUGEzeBlXOGDIG6tLfMhge__Usw0FjWtfroQm_CcHC9yIkory0-073QkSyey3f_FB-pHRGg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward</title><source>arXiv.org</source><creator>Jia, Zhiwei ; Nan, Yuesong ; Zhao, Huixi ; Liu, Gengdai</creator><creatorcontrib>Jia, Zhiwei ; Nan, Yuesong ; Zhao, Huixi ; Liu, Gengdai</creatorcontrib><description>Recent research has shown that fine-tuning diffusion models (DMs) with arbitrary rewards, including non-differentiable ones, is feasible with reinforcement learning (RL) techniques, enabling flexible model alignment. However, applying existing RL methods to timestep-distilled DMs is challenging for ultra-fast ($\le2$-step) image generation. Our analysis suggests several limitations of policy-based RL methods such as PPO or DPO toward this goal. Based on the insights, we propose fine-tuning DMs with learned differentiable surrogate rewards. Our method, named LaSRO, learns surrogate reward models in the latent space of SDXL to convert arbitrary rewards into differentiable ones for efficient reward gradient guidance. LaSRO leverages pre-trained latent DMs for reward modeling and specifically targets image generation $\le2$ steps for reward optimization, enhancing generalizability and efficiency. LaSRO is effective and stable for improving ultra-fast image generation with different reward objectives, outperforming popular RL methods including PPO and DPO. We further show LaSRO's connection to value-based RL, providing theoretical insights. See our webpage at https://sites.google.com/view/lasro.</description><identifier>DOI: 10.48550/arxiv.2411.15247</identifier><language>eng</language><subject>Computer Science - Learning</subject><creationdate>2024-11</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2411.15247$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2411.15247$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Jia, Zhiwei</creatorcontrib><creatorcontrib>Nan, Yuesong</creatorcontrib><creatorcontrib>Zhao, Huixi</creatorcontrib><creatorcontrib>Liu, Gengdai</creatorcontrib><title>Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward</title><description>Recent research has shown that fine-tuning diffusion models (DMs) with arbitrary rewards, including non-differentiable ones, is feasible with reinforcement learning (RL) techniques, enabling flexible model alignment. However, applying existing RL methods to timestep-distilled DMs is challenging for ultra-fast ($\le2$-step) image generation. Our analysis suggests several limitations of policy-based RL methods such as PPO or DPO toward this goal. Based on the insights, we propose fine-tuning DMs with learned differentiable surrogate rewards. Our method, named LaSRO, learns surrogate reward models in the latent space of SDXL to convert arbitrary rewards into differentiable ones for efficient reward gradient guidance. LaSRO leverages pre-trained latent DMs for reward modeling and specifically targets image generation $\le2$ steps for reward optimization, enhancing generalizability and efficiency. LaSRO is effective and stable for improving ultra-fast image generation with different reward objectives, outperforming popular RL methods including PPO and DPO. We further show LaSRO's connection to value-based RL, providing theoretical insights. See our webpage at https://sites.google.com/view/lasro.</description><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNqFjr0OgkAQhK-xMOoDWLkvcMghRHuVWGAj9GSVPbIJHuT407cX0N5qJjOTySfEWrmOfwgCd4v2xZ3j-Uo5KvD8_VzoG_VoMwjZkExawyaHpC9l3FAFJ9a6rbk0cC0zKmroGCEitNNsbMmSaRjvBUGEzeBlXOGDIG6tLfMhge__Usw0FjWtfroQm_CcHC9yIkory0-073QkSyey3f_FB-pHRGg</recordid><startdate>20241122</startdate><enddate>20241122</enddate><creator>Jia, Zhiwei</creator><creator>Nan, Yuesong</creator><creator>Zhao, Huixi</creator><creator>Liu, Gengdai</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20241122</creationdate><title>Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward</title><author>Jia, Zhiwei ; Nan, Yuesong ; Zhao, Huixi ; Liu, Gengdai</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2411_152473</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Jia, Zhiwei</creatorcontrib><creatorcontrib>Nan, Yuesong</creatorcontrib><creatorcontrib>Zhao, Huixi</creatorcontrib><creatorcontrib>Liu, Gengdai</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Jia, Zhiwei</au><au>Nan, Yuesong</au><au>Zhao, Huixi</au><au>Liu, Gengdai</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward</atitle><date>2024-11-22</date><risdate>2024</risdate><abstract>Recent research has shown that fine-tuning diffusion models (DMs) with arbitrary rewards, including non-differentiable ones, is feasible with reinforcement learning (RL) techniques, enabling flexible model alignment. However, applying existing RL methods to timestep-distilled DMs is challenging for ultra-fast ($\le2$-step) image generation. Our analysis suggests several limitations of policy-based RL methods such as PPO or DPO toward this goal. Based on the insights, we propose fine-tuning DMs with learned differentiable surrogate rewards. Our method, named LaSRO, learns surrogate reward models in the latent space of SDXL to convert arbitrary rewards into differentiable ones for efficient reward gradient guidance. LaSRO leverages pre-trained latent DMs for reward modeling and specifically targets image generation $\le2$ steps for reward optimization, enhancing generalizability and efficiency. LaSRO is effective and stable for improving ultra-fast image generation with different reward objectives, outperforming popular RL methods including PPO and DPO. We further show LaSRO's connection to value-based RL, providing theoretical insights. See our webpage at https://sites.google.com/view/lasro.</abstract><doi>10.48550/arxiv.2411.15247</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2411.15247
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2411_15247
source	arXiv.org
subjects	Computer Science - Learning
title	Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-06T07%3A03%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Reward%20Fine-Tuning%20Two-Step%20Diffusion%20Models%20via%20Learning%20Differentiable%20Latent-Space%20Surrogate%20Reward&rft.au=Jia,%20Zhiwei&rft.date=2024-11-22&rft_id=info:doi/10.48550/arxiv.2411.15247&rft_dat=%3Carxiv_GOX%3E2411_15247%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true