Video Prediction Models as Rewards for Reinforcement Learning
Specifying reward signals that allow agents to learn complex behaviors is a long-standing challenge in reinforcement learning. A promising approach is to extract preferences for behaviors from unlabeled videos, which are widely available on the internet. We present Video Prediction Rewards (VIPER),...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Escontrela, Alejandro Adeniji, Ademi Yan, Wilson Jain, Ajay Peng, Xue Bin Goldberg, Ken Lee, Youngwoon Hafner, Danijar Abbeel, Pieter |
description | Specifying reward signals that allow agents to learn complex behaviors is a
long-standing challenge in reinforcement learning. A promising approach is to
extract preferences for behaviors from unlabeled videos, which are widely
available on the internet. We present Video Prediction Rewards (VIPER), an
algorithm that leverages pretrained video prediction models as action-free
reward signals for reinforcement learning. Specifically, we first train an
autoregressive transformer on expert videos and then use the video prediction
likelihoods as reward signals for a reinforcement learning agent. VIPER enables
expert-level control without programmatic task rewards across a wide range of
DMC, Atari, and RLBench tasks. Moreover, generalization of the video prediction
model allows us to derive rewards for an out-of-distribution environment where
no expert data is available, enabling cross-embodiment generalization for
tabletop manipulation. We see our work as starting point for scalable reward
specification from unlabeled videos that will benefit from the rapid advances
in generative modeling. Source code and datasets are available on the project
website: https://escontrela.me/viper |
doi_str_mv | 10.48550/arxiv.2305.14343 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2305_14343</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2305_14343</sourcerecordid><originalsourceid>FETCH-LOGICAL-a673-a4932cbb0111355626a6fbfafc034fff3eb4cf45c6890e16eb46c7e21c7891f03</originalsourceid><addsrcrecordid>eNotj81OwzAQhH3hUBUeoCf8Agl21naSAwdUFaiUCoQqrtHG2UWWWgc5VYG3J_05fTOX0XxCLLTKTWWtesD0G455Acrm2oCBmXj8DD0N8j1RH_whDFFuhp52o8RRftAPpn6UPKQphzjR057iQTaEKYb4dStuGHcj3V05F9vn1Xb5mjVvL-vlU5OhKyFDU0Phu05prcFaVzh03DGyV2CYGagzno31rqoVaTdV50sqtC-rWrOCubi_zJ7_t98p7DH9tSeP9uwB_xhZQs4</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Video Prediction Models as Rewards for Reinforcement Learning</title><source>arXiv.org</source><creator>Escontrela, Alejandro ; Adeniji, Ademi ; Yan, Wilson ; Jain, Ajay ; Peng, Xue Bin ; Goldberg, Ken ; Lee, Youngwoon ; Hafner, Danijar ; Abbeel, Pieter</creator><creatorcontrib>Escontrela, Alejandro ; Adeniji, Ademi ; Yan, Wilson ; Jain, Ajay ; Peng, Xue Bin ; Goldberg, Ken ; Lee, Youngwoon ; Hafner, Danijar ; Abbeel, Pieter</creatorcontrib><description>Specifying reward signals that allow agents to learn complex behaviors is a
long-standing challenge in reinforcement learning. A promising approach is to
extract preferences for behaviors from unlabeled videos, which are widely
available on the internet. We present Video Prediction Rewards (VIPER), an
algorithm that leverages pretrained video prediction models as action-free
reward signals for reinforcement learning. Specifically, we first train an
autoregressive transformer on expert videos and then use the video prediction
likelihoods as reward signals for a reinforcement learning agent. VIPER enables
expert-level control without programmatic task rewards across a wide range of
DMC, Atari, and RLBench tasks. Moreover, generalization of the video prediction
model allows us to derive rewards for an out-of-distribution environment where
no expert data is available, enabling cross-embodiment generalization for
tabletop manipulation. We see our work as starting point for scalable reward
specification from unlabeled videos that will benefit from the rapid advances
in generative modeling. Source code and datasets are available on the project
website: https://escontrela.me/viper</description><identifier>DOI: 10.48550/arxiv.2305.14343</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Learning</subject><creationdate>2023-05</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2305.14343$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2305.14343$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Escontrela, Alejandro</creatorcontrib><creatorcontrib>Adeniji, Ademi</creatorcontrib><creatorcontrib>Yan, Wilson</creatorcontrib><creatorcontrib>Jain, Ajay</creatorcontrib><creatorcontrib>Peng, Xue Bin</creatorcontrib><creatorcontrib>Goldberg, Ken</creatorcontrib><creatorcontrib>Lee, Youngwoon</creatorcontrib><creatorcontrib>Hafner, Danijar</creatorcontrib><creatorcontrib>Abbeel, Pieter</creatorcontrib><title>Video Prediction Models as Rewards for Reinforcement Learning</title><description>Specifying reward signals that allow agents to learn complex behaviors is a
long-standing challenge in reinforcement learning. A promising approach is to
extract preferences for behaviors from unlabeled videos, which are widely
available on the internet. We present Video Prediction Rewards (VIPER), an
algorithm that leverages pretrained video prediction models as action-free
reward signals for reinforcement learning. Specifically, we first train an
autoregressive transformer on expert videos and then use the video prediction
likelihoods as reward signals for a reinforcement learning agent. VIPER enables
expert-level control without programmatic task rewards across a wide range of
DMC, Atari, and RLBench tasks. Moreover, generalization of the video prediction
model allows us to derive rewards for an out-of-distribution environment where
no expert data is available, enabling cross-embodiment generalization for
tabletop manipulation. We see our work as starting point for scalable reward
specification from unlabeled videos that will benefit from the rapid advances
in generative modeling. Source code and datasets are available on the project
website: https://escontrela.me/viper</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81OwzAQhH3hUBUeoCf8Agl21naSAwdUFaiUCoQqrtHG2UWWWgc5VYG3J_05fTOX0XxCLLTKTWWtesD0G455Acrm2oCBmXj8DD0N8j1RH_whDFFuhp52o8RRftAPpn6UPKQphzjR057iQTaEKYb4dStuGHcj3V05F9vn1Xb5mjVvL-vlU5OhKyFDU0Phu05prcFaVzh03DGyV2CYGagzno31rqoVaTdV50sqtC-rWrOCubi_zJ7_t98p7DH9tSeP9uwB_xhZQs4</recordid><startdate>20230523</startdate><enddate>20230523</enddate><creator>Escontrela, Alejandro</creator><creator>Adeniji, Ademi</creator><creator>Yan, Wilson</creator><creator>Jain, Ajay</creator><creator>Peng, Xue Bin</creator><creator>Goldberg, Ken</creator><creator>Lee, Youngwoon</creator><creator>Hafner, Danijar</creator><creator>Abbeel, Pieter</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230523</creationdate><title>Video Prediction Models as Rewards for Reinforcement Learning</title><author>Escontrela, Alejandro ; Adeniji, Ademi ; Yan, Wilson ; Jain, Ajay ; Peng, Xue Bin ; Goldberg, Ken ; Lee, Youngwoon ; Hafner, Danijar ; Abbeel, Pieter</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a673-a4932cbb0111355626a6fbfafc034fff3eb4cf45c6890e16eb46c7e21c7891f03</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Escontrela, Alejandro</creatorcontrib><creatorcontrib>Adeniji, Ademi</creatorcontrib><creatorcontrib>Yan, Wilson</creatorcontrib><creatorcontrib>Jain, Ajay</creatorcontrib><creatorcontrib>Peng, Xue Bin</creatorcontrib><creatorcontrib>Goldberg, Ken</creatorcontrib><creatorcontrib>Lee, Youngwoon</creatorcontrib><creatorcontrib>Hafner, Danijar</creatorcontrib><creatorcontrib>Abbeel, Pieter</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Escontrela, Alejandro</au><au>Adeniji, Ademi</au><au>Yan, Wilson</au><au>Jain, Ajay</au><au>Peng, Xue Bin</au><au>Goldberg, Ken</au><au>Lee, Youngwoon</au><au>Hafner, Danijar</au><au>Abbeel, Pieter</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Video Prediction Models as Rewards for Reinforcement Learning</atitle><date>2023-05-23</date><risdate>2023</risdate><abstract>Specifying reward signals that allow agents to learn complex behaviors is a
long-standing challenge in reinforcement learning. A promising approach is to
extract preferences for behaviors from unlabeled videos, which are widely
available on the internet. We present Video Prediction Rewards (VIPER), an
algorithm that leverages pretrained video prediction models as action-free
reward signals for reinforcement learning. Specifically, we first train an
autoregressive transformer on expert videos and then use the video prediction
likelihoods as reward signals for a reinforcement learning agent. VIPER enables
expert-level control without programmatic task rewards across a wide range of
DMC, Atari, and RLBench tasks. Moreover, generalization of the video prediction
model allows us to derive rewards for an out-of-distribution environment where
no expert data is available, enabling cross-embodiment generalization for
tabletop manipulation. We see our work as starting point for scalable reward
specification from unlabeled videos that will benefit from the rapid advances
in generative modeling. Source code and datasets are available on the project
website: https://escontrela.me/viper</abstract><doi>10.48550/arxiv.2305.14343</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2305.14343 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2305_14343 |
source | arXiv.org |
subjects | Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition Computer Science - Learning |
title | Video Prediction Models as Rewards for Reinforcement Learning |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-10T09%3A21%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Video%20Prediction%20Models%20as%20Rewards%20for%20Reinforcement%20Learning&rft.au=Escontrela,%20Alejandro&rft.date=2023-05-23&rft_id=info:doi/10.48550/arxiv.2305.14343&rft_dat=%3Carxiv_GOX%3E2305_14343%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |