Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos

Pretraining on noisy, internet-scale datasets has been heavily studied as a technique for training models with broad, general capabilities for text, images, and other modalities. However, for many sequential decision domains such as robotics, video games, and computer use, publicly available data do...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2022-06
Hauptverfasser:	Baker, Bowen, Akkaya, Ilge, Zhokhov, Peter, Huizinga, Joost, Tang, Jie, Ecoffet, Adrien, Houghton, Brandon, Sampedro, Raul, Clune, Jeff
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer & video games Diamond tools Domains Human performance Internet Inverse dynamics Keyboards Learning Mouse devices Robotics
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Baker, Bowen Akkaya, Ilge Zhokhov, Peter Huizinga, Joost Tang, Jie Ecoffet, Adrien Houghton, Brandon Sampedro, Raul Clune, Jeff
description	Pretraining on noisy, internet-scale datasets has been heavily studied as a technique for training models with broad, general capabilities for text, images, and other modalities. However, for many sequential decision domains such as robotics, video games, and computer use, publicly available data does not contain the labels required to train behavioral priors in the same way. We extend the internet-scale pretraining paradigm to sequential decision domains through semi-supervised imitation learning wherein agents learn to act by watching online unlabeled videos. Specifically, we show that with a small amount of labeled data we can train an inverse dynamics model accurate enough to label a huge unlabeled source of online data -- here, online videos of people playing Minecraft -- from which we can then train a general behavioral prior. Despite using the native human interface (mouse and keyboard at 20Hz), we show that this behavioral prior has nontrivial zero-shot capabilities and that it can be fine-tuned, with both imitation learning and reinforcement learning, to hard-exploration tasks that are impossible to learn from scratch via reinforcement learning. For many tasks our models exhibit human-level performance, and we are the first to report computer agents that can craft diamond tools, which can take proficient humans upwards of 20 minutes (24,000 environment actions) of gameplay to accomplish.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2680440373</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2680440373</sourcerecordid><originalsourceid>FETCH-proquest_journals_26804403733</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mTwCctMSc1XCChKDSlKzMzLzEtX0AgLCNG0UvBJTSwC80vyFRyTSxSSKhXCE0uSM0BCoXk5iUmpOakpCv55OZl5qQpgU4p5GFjTEnOKU3mhNDeDsptriLOHbkFRfmFpanFJfFZ-aVEeUCreyMzCwMTEwNjc2Jg4VQBpnTvQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2680440373</pqid></control><display><type>article</type><title>Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos</title><source>Free E- Journals</source><creator>Baker, Bowen ; Akkaya, Ilge ; Zhokhov, Peter ; Huizinga, Joost ; Tang, Jie ; Ecoffet, Adrien ; Houghton, Brandon ; Sampedro, Raul ; Clune, Jeff</creator><creatorcontrib>Baker, Bowen ; Akkaya, Ilge ; Zhokhov, Peter ; Huizinga, Joost ; Tang, Jie ; Ecoffet, Adrien ; Houghton, Brandon ; Sampedro, Raul ; Clune, Jeff</creatorcontrib><description>Pretraining on noisy, internet-scale datasets has been heavily studied as a technique for training models with broad, general capabilities for text, images, and other modalities. However, for many sequential decision domains such as robotics, video games, and computer use, publicly available data does not contain the labels required to train behavioral priors in the same way. We extend the internet-scale pretraining paradigm to sequential decision domains through semi-supervised imitation learning wherein agents learn to act by watching online unlabeled videos. Specifically, we show that with a small amount of labeled data we can train an inverse dynamics model accurate enough to label a huge unlabeled source of online data -- here, online videos of people playing Minecraft -- from which we can then train a general behavioral prior. Despite using the native human interface (mouse and keyboard at 20Hz), we show that this behavioral prior has nontrivial zero-shot capabilities and that it can be fine-tuned, with both imitation learning and reinforcement learning, to hard-exploration tasks that are impossible to learn from scratch via reinforcement learning. For many tasks our models exhibit human-level performance, and we are the first to report computer agents that can craft diamond tools, which can take proficient humans upwards of 20 minutes (24,000 environment actions) of gameplay to accomplish.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Computer & video games ; Diamond tools ; Domains ; Human performance ; Internet ; Inverse dynamics ; Keyboards ; Learning ; Mouse devices ; Robotics</subject><ispartof>arXiv.org, 2022-06</ispartof><rights>2022. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Baker, Bowen</creatorcontrib><creatorcontrib>Akkaya, Ilge</creatorcontrib><creatorcontrib>Zhokhov, Peter</creatorcontrib><creatorcontrib>Huizinga, Joost</creatorcontrib><creatorcontrib>Tang, Jie</creatorcontrib><creatorcontrib>Ecoffet, Adrien</creatorcontrib><creatorcontrib>Houghton, Brandon</creatorcontrib><creatorcontrib>Sampedro, Raul</creatorcontrib><creatorcontrib>Clune, Jeff</creatorcontrib><title>Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos</title><title>arXiv.org</title><description>Pretraining on noisy, internet-scale datasets has been heavily studied as a technique for training models with broad, general capabilities for text, images, and other modalities. However, for many sequential decision domains such as robotics, video games, and computer use, publicly available data does not contain the labels required to train behavioral priors in the same way. We extend the internet-scale pretraining paradigm to sequential decision domains through semi-supervised imitation learning wherein agents learn to act by watching online unlabeled videos. Specifically, we show that with a small amount of labeled data we can train an inverse dynamics model accurate enough to label a huge unlabeled source of online data -- here, online videos of people playing Minecraft -- from which we can then train a general behavioral prior. Despite using the native human interface (mouse and keyboard at 20Hz), we show that this behavioral prior has nontrivial zero-shot capabilities and that it can be fine-tuned, with both imitation learning and reinforcement learning, to hard-exploration tasks that are impossible to learn from scratch via reinforcement learning. For many tasks our models exhibit human-level performance, and we are the first to report computer agents that can craft diamond tools, which can take proficient humans upwards of 20 minutes (24,000 environment actions) of gameplay to accomplish.</description><subject>Computer & video games</subject><subject>Diamond tools</subject><subject>Domains</subject><subject>Human performance</subject><subject>Internet</subject><subject>Inverse dynamics</subject><subject>Keyboards</subject><subject>Learning</subject><subject>Mouse devices</subject><subject>Robotics</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mTwCctMSc1XCChKDSlKzMzLzEtX0AgLCNG0UvBJTSwC80vyFRyTSxSSKhXCE0uSM0BCoXk5iUmpOakpCv55OZl5qQpgU4p5GFjTEnOKU3mhNDeDsptriLOHbkFRfmFpanFJfFZ-aVEeUCreyMzCwMTEwNjc2Jg4VQBpnTvQ</recordid><startdate>20220623</startdate><enddate>20220623</enddate><creator>Baker, Bowen</creator><creator>Akkaya, Ilge</creator><creator>Zhokhov, Peter</creator><creator>Huizinga, Joost</creator><creator>Tang, Jie</creator><creator>Ecoffet, Adrien</creator><creator>Houghton, Brandon</creator><creator>Sampedro, Raul</creator><creator>Clune, Jeff</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20220623</creationdate><title>Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos</title><author>Baker, Bowen ; Akkaya, Ilge ; Zhokhov, Peter ; Huizinga, Joost ; Tang, Jie ; Ecoffet, Adrien ; Houghton, Brandon ; Sampedro, Raul ; Clune, Jeff</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_26804403733</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer & video games</topic><topic>Diamond tools</topic><topic>Domains</topic><topic>Human performance</topic><topic>Internet</topic><topic>Inverse dynamics</topic><topic>Keyboards</topic><topic>Learning</topic><topic>Mouse devices</topic><topic>Robotics</topic><toplevel>online_resources</toplevel><creatorcontrib>Baker, Bowen</creatorcontrib><creatorcontrib>Akkaya, Ilge</creatorcontrib><creatorcontrib>Zhokhov, Peter</creatorcontrib><creatorcontrib>Huizinga, Joost</creatorcontrib><creatorcontrib>Tang, Jie</creatorcontrib><creatorcontrib>Ecoffet, Adrien</creatorcontrib><creatorcontrib>Houghton, Brandon</creatorcontrib><creatorcontrib>Sampedro, Raul</creatorcontrib><creatorcontrib>Clune, Jeff</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Baker, Bowen</au><au>Akkaya, Ilge</au><au>Zhokhov, Peter</au><au>Huizinga, Joost</au><au>Tang, Jie</au><au>Ecoffet, Adrien</au><au>Houghton, Brandon</au><au>Sampedro, Raul</au><au>Clune, Jeff</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos</atitle><jtitle>arXiv.org</jtitle><date>2022-06-23</date><risdate>2022</risdate><eissn>2331-8422</eissn><abstract>Pretraining on noisy, internet-scale datasets has been heavily studied as a technique for training models with broad, general capabilities for text, images, and other modalities. However, for many sequential decision domains such as robotics, video games, and computer use, publicly available data does not contain the labels required to train behavioral priors in the same way. We extend the internet-scale pretraining paradigm to sequential decision domains through semi-supervised imitation learning wherein agents learn to act by watching online unlabeled videos. Specifically, we show that with a small amount of labeled data we can train an inverse dynamics model accurate enough to label a huge unlabeled source of online data -- here, online videos of people playing Minecraft -- from which we can then train a general behavioral prior. Despite using the native human interface (mouse and keyboard at 20Hz), we show that this behavioral prior has nontrivial zero-shot capabilities and that it can be fine-tuned, with both imitation learning and reinforcement learning, to hard-exploration tasks that are impossible to learn from scratch via reinforcement learning. For many tasks our models exhibit human-level performance, and we are the first to report computer agents that can craft diamond tools, which can take proficient humans upwards of 20 minutes (24,000 environment actions) of gameplay to accomplish.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2022-06
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2680440373
source	Free E- Journals
subjects	Computer & video games Diamond tools Domains Human performance Internet Inverse dynamics Keyboards Learning Mouse devices Robotics
title	Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T10%3A56%3A26IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Video%20PreTraining%20(VPT):%20Learning%20to%20Act%20by%20Watching%20Unlabeled%20Online%20Videos&rft.jtitle=arXiv.org&rft.au=Baker,%20Bowen&rft.date=2022-06-23&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2680440373%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2680440373&rft_id=info:pmid/&rfr_iscdi=true