Video Prediction Models as Rewards for Reinforcement Learning

Specifying reward signals that allow agents to learn complex behaviors is a long-standing challenge in reinforcement learning. A promising approach is to extract preferences for behaviors from unlabeled videos, which are widely available on the internet. We present Video Prediction Rewards (VIPER),...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Escontrela, Alejandro, Adeniji, Ademi, Yan, Wilson, Jain, Ajay, Peng, Xue Bin, Goldberg, Ken, Lee, Youngwoon, Hafner, Danijar, Abbeel, Pieter
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Escontrela, Alejandro
Adeniji, Ademi
Yan, Wilson
Jain, Ajay
Peng, Xue Bin
Goldberg, Ken
Lee, Youngwoon
Hafner, Danijar
Abbeel, Pieter
description Specifying reward signals that allow agents to learn complex behaviors is a long-standing challenge in reinforcement learning. A promising approach is to extract preferences for behaviors from unlabeled videos, which are widely available on the internet. We present Video Prediction Rewards (VIPER), an algorithm that leverages pretrained video prediction models as action-free reward signals for reinforcement learning. Specifically, we first train an autoregressive transformer on expert videos and then use the video prediction likelihoods as reward signals for a reinforcement learning agent. VIPER enables expert-level control without programmatic task rewards across a wide range of DMC, Atari, and RLBench tasks. Moreover, generalization of the video prediction model allows us to derive rewards for an out-of-distribution environment where no expert data is available, enabling cross-embodiment generalization for tabletop manipulation. We see our work as starting point for scalable reward specification from unlabeled videos that will benefit from the rapid advances in generative modeling. Source code and datasets are available on the project website: https://escontrela.me/viper
doi_str_mv 10.48550/arxiv.2305.14343
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2305_14343</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2305_14343</sourcerecordid><originalsourceid>FETCH-LOGICAL-a673-a4932cbb0111355626a6fbfafc034fff3eb4cf45c6890e16eb46c7e21c7891f03</originalsourceid><addsrcrecordid>eNotj81OwzAQhH3hUBUeoCf8Agl21naSAwdUFaiUCoQqrtHG2UWWWgc5VYG3J_05fTOX0XxCLLTKTWWtesD0G455Acrm2oCBmXj8DD0N8j1RH_whDFFuhp52o8RRftAPpn6UPKQphzjR057iQTaEKYb4dStuGHcj3V05F9vn1Xb5mjVvL-vlU5OhKyFDU0Phu05prcFaVzh03DGyV2CYGagzno31rqoVaTdV50sqtC-rWrOCubi_zJ7_t98p7DH9tSeP9uwB_xhZQs4</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Video Prediction Models as Rewards for Reinforcement Learning</title><source>arXiv.org</source><creator>Escontrela, Alejandro ; Adeniji, Ademi ; Yan, Wilson ; Jain, Ajay ; Peng, Xue Bin ; Goldberg, Ken ; Lee, Youngwoon ; Hafner, Danijar ; Abbeel, Pieter</creator><creatorcontrib>Escontrela, Alejandro ; Adeniji, Ademi ; Yan, Wilson ; Jain, Ajay ; Peng, Xue Bin ; Goldberg, Ken ; Lee, Youngwoon ; Hafner, Danijar ; Abbeel, Pieter</creatorcontrib><description>Specifying reward signals that allow agents to learn complex behaviors is a long-standing challenge in reinforcement learning. A promising approach is to extract preferences for behaviors from unlabeled videos, which are widely available on the internet. We present Video Prediction Rewards (VIPER), an algorithm that leverages pretrained video prediction models as action-free reward signals for reinforcement learning. Specifically, we first train an autoregressive transformer on expert videos and then use the video prediction likelihoods as reward signals for a reinforcement learning agent. VIPER enables expert-level control without programmatic task rewards across a wide range of DMC, Atari, and RLBench tasks. Moreover, generalization of the video prediction model allows us to derive rewards for an out-of-distribution environment where no expert data is available, enabling cross-embodiment generalization for tabletop manipulation. We see our work as starting point for scalable reward specification from unlabeled videos that will benefit from the rapid advances in generative modeling. Source code and datasets are available on the project website: https://escontrela.me/viper</description><identifier>DOI: 10.48550/arxiv.2305.14343</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Learning</subject><creationdate>2023-05</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2305.14343$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2305.14343$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Escontrela, Alejandro</creatorcontrib><creatorcontrib>Adeniji, Ademi</creatorcontrib><creatorcontrib>Yan, Wilson</creatorcontrib><creatorcontrib>Jain, Ajay</creatorcontrib><creatorcontrib>Peng, Xue Bin</creatorcontrib><creatorcontrib>Goldberg, Ken</creatorcontrib><creatorcontrib>Lee, Youngwoon</creatorcontrib><creatorcontrib>Hafner, Danijar</creatorcontrib><creatorcontrib>Abbeel, Pieter</creatorcontrib><title>Video Prediction Models as Rewards for Reinforcement Learning</title><description>Specifying reward signals that allow agents to learn complex behaviors is a long-standing challenge in reinforcement learning. A promising approach is to extract preferences for behaviors from unlabeled videos, which are widely available on the internet. We present Video Prediction Rewards (VIPER), an algorithm that leverages pretrained video prediction models as action-free reward signals for reinforcement learning. Specifically, we first train an autoregressive transformer on expert videos and then use the video prediction likelihoods as reward signals for a reinforcement learning agent. VIPER enables expert-level control without programmatic task rewards across a wide range of DMC, Atari, and RLBench tasks. Moreover, generalization of the video prediction model allows us to derive rewards for an out-of-distribution environment where no expert data is available, enabling cross-embodiment generalization for tabletop manipulation. We see our work as starting point for scalable reward specification from unlabeled videos that will benefit from the rapid advances in generative modeling. Source code and datasets are available on the project website: https://escontrela.me/viper</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81OwzAQhH3hUBUeoCf8Agl21naSAwdUFaiUCoQqrtHG2UWWWgc5VYG3J_05fTOX0XxCLLTKTWWtesD0G455Acrm2oCBmXj8DD0N8j1RH_whDFFuhp52o8RRftAPpn6UPKQphzjR057iQTaEKYb4dStuGHcj3V05F9vn1Xb5mjVvL-vlU5OhKyFDU0Phu05prcFaVzh03DGyV2CYGagzno31rqoVaTdV50sqtC-rWrOCubi_zJ7_t98p7DH9tSeP9uwB_xhZQs4</recordid><startdate>20230523</startdate><enddate>20230523</enddate><creator>Escontrela, Alejandro</creator><creator>Adeniji, Ademi</creator><creator>Yan, Wilson</creator><creator>Jain, Ajay</creator><creator>Peng, Xue Bin</creator><creator>Goldberg, Ken</creator><creator>Lee, Youngwoon</creator><creator>Hafner, Danijar</creator><creator>Abbeel, Pieter</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230523</creationdate><title>Video Prediction Models as Rewards for Reinforcement Learning</title><author>Escontrela, Alejandro ; Adeniji, Ademi ; Yan, Wilson ; Jain, Ajay ; Peng, Xue Bin ; Goldberg, Ken ; Lee, Youngwoon ; Hafner, Danijar ; Abbeel, Pieter</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a673-a4932cbb0111355626a6fbfafc034fff3eb4cf45c6890e16eb46c7e21c7891f03</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Escontrela, Alejandro</creatorcontrib><creatorcontrib>Adeniji, Ademi</creatorcontrib><creatorcontrib>Yan, Wilson</creatorcontrib><creatorcontrib>Jain, Ajay</creatorcontrib><creatorcontrib>Peng, Xue Bin</creatorcontrib><creatorcontrib>Goldberg, Ken</creatorcontrib><creatorcontrib>Lee, Youngwoon</creatorcontrib><creatorcontrib>Hafner, Danijar</creatorcontrib><creatorcontrib>Abbeel, Pieter</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Escontrela, Alejandro</au><au>Adeniji, Ademi</au><au>Yan, Wilson</au><au>Jain, Ajay</au><au>Peng, Xue Bin</au><au>Goldberg, Ken</au><au>Lee, Youngwoon</au><au>Hafner, Danijar</au><au>Abbeel, Pieter</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Video Prediction Models as Rewards for Reinforcement Learning</atitle><date>2023-05-23</date><risdate>2023</risdate><abstract>Specifying reward signals that allow agents to learn complex behaviors is a long-standing challenge in reinforcement learning. A promising approach is to extract preferences for behaviors from unlabeled videos, which are widely available on the internet. We present Video Prediction Rewards (VIPER), an algorithm that leverages pretrained video prediction models as action-free reward signals for reinforcement learning. Specifically, we first train an autoregressive transformer on expert videos and then use the video prediction likelihoods as reward signals for a reinforcement learning agent. VIPER enables expert-level control without programmatic task rewards across a wide range of DMC, Atari, and RLBench tasks. Moreover, generalization of the video prediction model allows us to derive rewards for an out-of-distribution environment where no expert data is available, enabling cross-embodiment generalization for tabletop manipulation. We see our work as starting point for scalable reward specification from unlabeled videos that will benefit from the rapid advances in generative modeling. Source code and datasets are available on the project website: https://escontrela.me/viper</abstract><doi>10.48550/arxiv.2305.14343</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2305.14343
ispartof
issn
language eng
recordid cdi_arxiv_primary_2305_14343
source arXiv.org
subjects Computer Science - Artificial Intelligence
Computer Science - Computer Vision and Pattern Recognition
Computer Science - Learning
title Video Prediction Models as Rewards for Reinforcement Learning
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-10T09%3A21%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Video%20Prediction%20Models%20as%20Rewards%20for%20Reinforcement%20Learning&rft.au=Escontrela,%20Alejandro&rft.date=2023-05-23&rft_id=info:doi/10.48550/arxiv.2305.14343&rft_dat=%3Carxiv_GOX%3E2305_14343%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true