Video Prediction Models as Rewards for Reinforcement Learning

Specifying reward signals that allow agents to learn complex behaviors is a long-standing challenge in reinforcement learning. A promising approach is to extract preferences for behaviors from unlabeled videos, which are widely available on the internet. We present Video Prediction Rewards (VIPER),...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Escontrela, Alejandro, Adeniji, Ademi, Yan, Wilson, Jain, Ajay, Peng, Xue Bin, Goldberg, Ken, Lee, Youngwoon, Hafner, Danijar, Abbeel, Pieter
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Escontrela, Alejandro Adeniji, Ademi Yan, Wilson Jain, Ajay Peng, Xue Bin Goldberg, Ken Lee, Youngwoon Hafner, Danijar Abbeel, Pieter
description	Specifying reward signals that allow agents to learn complex behaviors is a long-standing challenge in reinforcement learning. A promising approach is to extract preferences for behaviors from unlabeled videos, which are widely available on the internet. We present Video Prediction Rewards (VIPER), an algorithm that leverages pretrained video prediction models as action-free reward signals for reinforcement learning. Specifically, we first train an autoregressive transformer on expert videos and then use the video prediction likelihoods as reward signals for a reinforcement learning agent. VIPER enables expert-level control without programmatic task rewards across a wide range of DMC, Atari, and RLBench tasks. Moreover, generalization of the video prediction model allows us to derive rewards for an out-of-distribution environment where no expert data is available, enabling cross-embodiment generalization for tabletop manipulation. We see our work as starting point for scalable reward specification from unlabeled videos that will benefit from the rapid advances in generative modeling. Source code and datasets are available on the project website: https://escontrela.me/viper
doi_str_mv	10.48550/arxiv.2305.14343
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2305_14343</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2305_14343</sourcerecordid><originalsourceid>FETCH-LOGICAL-a673-a4932cbb0111355626a6fbfafc034fff3eb4cf45c6890e16eb46c7e21c7891f03</originalsourceid><addsrcrecordid>eNotj81OwzAQhH3hUBUeoCf8Agl21naSAwdUFaiUCoQqrtHG2UWWWgc5VYG3J_05fTOX0XxCLLTKTWWtesD0G455Acrm2oCBmXj8DD0N8j1RH_whDFFuhp52o8RRftAPpn6UPKQphzjR057iQTaEKYb4dStuGHcj3V05F9vn1Xb5mjVvL-vlU5OhKyFDU0Phu05prcFaVzh03DGyV2CYGagzno31rqoVaTdV50sqtC-rWrOCubi_zJ7_t98p7DH9tSeP9uwB_xhZQs4</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Video Prediction Models as Rewards for Reinforcement Learning</title><source>arXiv.org</source><creator>Escontrela, Alejandro ; Adeniji, Ademi ; Yan, Wilson ; Jain, Ajay ; Peng, Xue Bin ; Goldberg, Ken ; Lee, Youngwoon ; Hafner, Danijar ; Abbeel, Pieter</creator><creatorcontrib>Escontrela, Alejandro ; Adeniji, Ademi ; Yan, Wilson ; Jain, Ajay ; Peng, Xue Bin ; Goldberg, Ken ; Lee, Youngwoon ; Hafner, Danijar ; Abbeel, Pieter</creatorcontrib><description>Specifying reward signals that allow agents to learn complex behaviors is a long-standing challenge in reinforcement learning. A promising approach is to extract preferences for behaviors from unlabeled videos, which are widely available on the internet. We present Video Prediction Rewards (VIPER), an algorithm that leverages pretrained video prediction models as action-free reward signals for reinforcement learning. Specifically, we first train an autoregressive transformer on expert videos and then use the video prediction likelihoods as reward signals for a reinforcement learning agent. VIPER enables expert-level control without programmatic task rewards across a wide range of DMC, Atari, and RLBench tasks. Moreover, generalization of the video prediction model allows us to derive rewards for an out-of-distribution environment where no expert data is available, enabling cross-embodiment generalization for tabletop manipulation. We see our work as starting point for scalable reward specification from unlabeled videos that will benefit from the rapid advances in generative modeling. Source code and datasets are available on the project website: https://escontrela.me/viper</description><identifier>DOI: 10.48550/arxiv.2305.14343</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Learning</subject><creationdate>2023-05</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2305.14343$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2305.14343$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Escontrela, Alejandro</creatorcontrib><creatorcontrib>Adeniji, Ademi</creatorcontrib><creatorcontrib>Yan, Wilson</creatorcontrib><creatorcontrib>Jain, Ajay</creatorcontrib><creatorcontrib>Peng, Xue Bin</creatorcontrib><creatorcontrib>Goldberg, Ken</creatorcontrib><creatorcontrib>Lee, Youngwoon</creatorcontrib><creatorcontrib>Hafner, Danijar</creatorcontrib><creatorcontrib>Abbeel, Pieter</creatorcontrib><title>Video Prediction Models as Rewards for Reinforcement Learning</title><description>Specifying reward signals that allow agents to learn complex behaviors is a long-standing challenge in reinforcement learning. A promising approach is to extract preferences for behaviors from unlabeled videos, which are widely available on the internet. We present Video Prediction Rewards (VIPER), an algorithm that leverages pretrained video prediction models as action-free reward signals for reinforcement learning. Specifically, we first train an autoregressive transformer on expert videos and then use the video prediction likelihoods as reward signals for a reinforcement learning agent. VIPER enables expert-level control without programmatic task rewards across a wide range of DMC, Atari, and RLBench tasks. Moreover, generalization of the video prediction model allows us to derive rewards for an out-of-distribution environment where no expert data is available, enabling cross-embodiment generalization for tabletop manipulation. We see our work as starting point for scalable reward specification from unlabeled videos that will benefit from the rapid advances in generative modeling. Source code and datasets are available on the project website: https://escontrela.me/viper</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81OwzAQhH3hUBUeoCf8Agl21naSAwdUFaiUCoQqrtHG2UWWWgc5VYG3J_05fTOX0XxCLLTKTWWtesD0G455Acrm2oCBmXj8DD0N8j1RH_whDFFuhp52o8RRftAPpn6UPKQphzjR057iQTaEKYb4dStuGHcj3V05F9vn1Xb5mjVvL-vlU5OhKyFDU0Phu05prcFaVzh03DGyV2CYGagzno31rqoVaTdV50sqtC-rWrOCubi_zJ7_t98p7DH9tSeP9uwB_xhZQs4</recordid><startdate>20230523</startdate><enddate>20230523</enddate><creator>Escontrela, Alejandro</creator><creator>Adeniji, Ademi</creator><creator>Yan, Wilson</creator><creator>Jain, Ajay</creator><creator>Peng, Xue Bin</creator><creator>Goldberg, Ken</creator><creator>Lee, Youngwoon</creator><creator>Hafner, Danijar</creator><creator>Abbeel, Pieter</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230523</creationdate><title>Video Prediction Models as Rewards for Reinforcement Learning</title><author>Escontrela, Alejandro ; Adeniji, Ademi ; Yan, Wilson ; Jain, Ajay ; Peng, Xue Bin ; Goldberg, Ken ; Lee, Youngwoon ; Hafner, Danijar ; Abbeel, Pieter</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a673-a4932cbb0111355626a6fbfafc034fff3eb4cf45c6890e16eb46c7e21c7891f03</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Escontrela, Alejandro</creatorcontrib><creatorcontrib>Adeniji, Ademi</creatorcontrib><creatorcontrib>Yan, Wilson</creatorcontrib><creatorcontrib>Jain, Ajay</creatorcontrib><creatorcontrib>Peng, Xue Bin</creatorcontrib><creatorcontrib>Goldberg, Ken</creatorcontrib><creatorcontrib>Lee, Youngwoon</creatorcontrib><creatorcontrib>Hafner, Danijar</creatorcontrib><creatorcontrib>Abbeel, Pieter</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Escontrela, Alejandro</au><au>Adeniji, Ademi</au><au>Yan, Wilson</au><au>Jain, Ajay</au><au>Peng, Xue Bin</au><au>Goldberg, Ken</au><au>Lee, Youngwoon</au><au>Hafner, Danijar</au><au>Abbeel, Pieter</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Video Prediction Models as Rewards for Reinforcement Learning</atitle><date>2023-05-23</date><risdate>2023</risdate><abstract>Specifying reward signals that allow agents to learn complex behaviors is a long-standing challenge in reinforcement learning. A promising approach is to extract preferences for behaviors from unlabeled videos, which are widely available on the internet. We present Video Prediction Rewards (VIPER), an algorithm that leverages pretrained video prediction models as action-free reward signals for reinforcement learning. Specifically, we first train an autoregressive transformer on expert videos and then use the video prediction likelihoods as reward signals for a reinforcement learning agent. VIPER enables expert-level control without programmatic task rewards across a wide range of DMC, Atari, and RLBench tasks. Moreover, generalization of the video prediction model allows us to derive rewards for an out-of-distribution environment where no expert data is available, enabling cross-embodiment generalization for tabletop manipulation. We see our work as starting point for scalable reward specification from unlabeled videos that will benefit from the rapid advances in generative modeling. Source code and datasets are available on the project website: https://escontrela.me/viper</abstract><doi>10.48550/arxiv.2305.14343</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2305.14343
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2305_14343
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition Computer Science - Learning
title	Video Prediction Models as Rewards for Reinforcement Learning
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-10T09%3A21%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Video%20Prediction%20Models%20as%20Rewards%20for%20Reinforcement%20Learning&rft.au=Escontrela,%20Alejandro&rft.date=2023-05-23&rft_id=info:doi/10.48550/arxiv.2305.14343&rft_dat=%3Carxiv_GOX%3E2305_14343%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true