Vision-Language Models as Success Detectors
Detecting successful behaviour is crucial for training intelligent agents. As such, generalisable reward models are a prerequisite for agents that can learn to generalise their behaviour. In this work we focus on developing robust success detectors that leverage large, pretrained vision-language mod...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Du, Yuqing Konyushkova, Ksenia Denil, Misha Raju, Akhil Landon, Jessica Hill, Felix de Freitas, Nando Cabi, Serkan |
description | Detecting successful behaviour is crucial for training intelligent agents. As
such, generalisable reward models are a prerequisite for agents that can learn
to generalise their behaviour. In this work we focus on developing robust
success detectors that leverage large, pretrained vision-language models
(Flamingo, Alayrac et al. (2022)) and human reward annotations. Concretely, we
treat success detection as a visual question answering (VQA) problem, denoted
SuccessVQA. We study success detection across three vastly different domains:
(i) interactive language-conditioned agents in a simulated household, (ii) real
world robotic manipulation, and (iii) "in-the-wild" human egocentric videos. We
investigate the generalisation properties of a Flamingo-based success detection
model across unseen language and visual changes in the first two domains, and
find that the proposed method is able to outperform bespoke reward models in
out-of-distribution test scenarios with either variation. In the last domain of
"in-the-wild" human videos, we show that success detection on unseen real
videos presents an even more challenging generalisation task warranting future
work. We hope our initial results encourage further work in real world success
detection and reward modelling. |
doi_str_mv | 10.48550/arxiv.2303.07280 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2303_07280</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2303_07280</sourcerecordid><originalsourceid>FETCH-LOGICAL-a670-b72872a8b39b83bd61dfeb66627579617d7cf02daf582e8611ce829e55f5f8bf3</originalsourceid><addsrcrecordid>eNotzrtuwjAUgGEvDAh4AKZmrxJ8qS8ZEbciBTEQdY2O7WMUKSQoBkTfvhSY_u3XR8iU0ezLSEln0N_rW8YFFRnV3NAh-fypY921aQHt8QpHTHadxyYmEJPD1TmMMVniBd2l6-OYDAI0ESfvjki5XpWL77TYb7aLeZGC0jS1j7HmYKzIrRHWK-YDWqUU11LnimmvXaDcQ5CGo1GMOTQ8RymDDMYGMSIfr-1TW537-gT9b_Wvrp5q8QeMKzvR</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Vision-Language Models as Success Detectors</title><source>arXiv.org</source><creator>Du, Yuqing ; Konyushkova, Ksenia ; Denil, Misha ; Raju, Akhil ; Landon, Jessica ; Hill, Felix ; de Freitas, Nando ; Cabi, Serkan</creator><creatorcontrib>Du, Yuqing ; Konyushkova, Ksenia ; Denil, Misha ; Raju, Akhil ; Landon, Jessica ; Hill, Felix ; de Freitas, Nando ; Cabi, Serkan</creatorcontrib><description>Detecting successful behaviour is crucial for training intelligent agents. As
such, generalisable reward models are a prerequisite for agents that can learn
to generalise their behaviour. In this work we focus on developing robust
success detectors that leverage large, pretrained vision-language models
(Flamingo, Alayrac et al. (2022)) and human reward annotations. Concretely, we
treat success detection as a visual question answering (VQA) problem, denoted
SuccessVQA. We study success detection across three vastly different domains:
(i) interactive language-conditioned agents in a simulated household, (ii) real
world robotic manipulation, and (iii) "in-the-wild" human egocentric videos. We
investigate the generalisation properties of a Flamingo-based success detection
model across unseen language and visual changes in the first two domains, and
find that the proposed method is able to outperform bespoke reward models in
out-of-distribution test scenarios with either variation. In the last domain of
"in-the-wild" human videos, we show that success detection on unseen real
videos presents an even more challenging generalisation task warranting future
work. We hope our initial results encourage further work in real world success
detection and reward modelling.</description><identifier>DOI: 10.48550/arxiv.2303.07280</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Learning</subject><creationdate>2023-03</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,778,883</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2303.07280$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2303.07280$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Du, Yuqing</creatorcontrib><creatorcontrib>Konyushkova, Ksenia</creatorcontrib><creatorcontrib>Denil, Misha</creatorcontrib><creatorcontrib>Raju, Akhil</creatorcontrib><creatorcontrib>Landon, Jessica</creatorcontrib><creatorcontrib>Hill, Felix</creatorcontrib><creatorcontrib>de Freitas, Nando</creatorcontrib><creatorcontrib>Cabi, Serkan</creatorcontrib><title>Vision-Language Models as Success Detectors</title><description>Detecting successful behaviour is crucial for training intelligent agents. As
such, generalisable reward models are a prerequisite for agents that can learn
to generalise their behaviour. In this work we focus on developing robust
success detectors that leverage large, pretrained vision-language models
(Flamingo, Alayrac et al. (2022)) and human reward annotations. Concretely, we
treat success detection as a visual question answering (VQA) problem, denoted
SuccessVQA. We study success detection across three vastly different domains:
(i) interactive language-conditioned agents in a simulated household, (ii) real
world robotic manipulation, and (iii) "in-the-wild" human egocentric videos. We
investigate the generalisation properties of a Flamingo-based success detection
model across unseen language and visual changes in the first two domains, and
find that the proposed method is able to outperform bespoke reward models in
out-of-distribution test scenarios with either variation. In the last domain of
"in-the-wild" human videos, we show that success detection on unseen real
videos presents an even more challenging generalisation task warranting future
work. We hope our initial results encourage further work in real world success
detection and reward modelling.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotzrtuwjAUgGEvDAh4AKZmrxJ8qS8ZEbciBTEQdY2O7WMUKSQoBkTfvhSY_u3XR8iU0ezLSEln0N_rW8YFFRnV3NAh-fypY921aQHt8QpHTHadxyYmEJPD1TmMMVniBd2l6-OYDAI0ESfvjki5XpWL77TYb7aLeZGC0jS1j7HmYKzIrRHWK-YDWqUU11LnimmvXaDcQ5CGo1GMOTQ8RymDDMYGMSIfr-1TW537-gT9b_Wvrp5q8QeMKzvR</recordid><startdate>20230313</startdate><enddate>20230313</enddate><creator>Du, Yuqing</creator><creator>Konyushkova, Ksenia</creator><creator>Denil, Misha</creator><creator>Raju, Akhil</creator><creator>Landon, Jessica</creator><creator>Hill, Felix</creator><creator>de Freitas, Nando</creator><creator>Cabi, Serkan</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230313</creationdate><title>Vision-Language Models as Success Detectors</title><author>Du, Yuqing ; Konyushkova, Ksenia ; Denil, Misha ; Raju, Akhil ; Landon, Jessica ; Hill, Felix ; de Freitas, Nando ; Cabi, Serkan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a670-b72872a8b39b83bd61dfeb66627579617d7cf02daf582e8611ce829e55f5f8bf3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Du, Yuqing</creatorcontrib><creatorcontrib>Konyushkova, Ksenia</creatorcontrib><creatorcontrib>Denil, Misha</creatorcontrib><creatorcontrib>Raju, Akhil</creatorcontrib><creatorcontrib>Landon, Jessica</creatorcontrib><creatorcontrib>Hill, Felix</creatorcontrib><creatorcontrib>de Freitas, Nando</creatorcontrib><creatorcontrib>Cabi, Serkan</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Du, Yuqing</au><au>Konyushkova, Ksenia</au><au>Denil, Misha</au><au>Raju, Akhil</au><au>Landon, Jessica</au><au>Hill, Felix</au><au>de Freitas, Nando</au><au>Cabi, Serkan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Vision-Language Models as Success Detectors</atitle><date>2023-03-13</date><risdate>2023</risdate><abstract>Detecting successful behaviour is crucial for training intelligent agents. As
such, generalisable reward models are a prerequisite for agents that can learn
to generalise their behaviour. In this work we focus on developing robust
success detectors that leverage large, pretrained vision-language models
(Flamingo, Alayrac et al. (2022)) and human reward annotations. Concretely, we
treat success detection as a visual question answering (VQA) problem, denoted
SuccessVQA. We study success detection across three vastly different domains:
(i) interactive language-conditioned agents in a simulated household, (ii) real
world robotic manipulation, and (iii) "in-the-wild" human egocentric videos. We
investigate the generalisation properties of a Flamingo-based success detection
model across unseen language and visual changes in the first two domains, and
find that the proposed method is able to outperform bespoke reward models in
out-of-distribution test scenarios with either variation. In the last domain of
"in-the-wild" human videos, we show that success detection on unseen real
videos presents an even more challenging generalisation task warranting future
work. We hope our initial results encourage further work in real world success
detection and reward modelling.</abstract><doi>10.48550/arxiv.2303.07280</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2303.07280 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2303_07280 |
source | arXiv.org |
subjects | Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition Computer Science - Learning |
title | Vision-Language Models as Success Detectors |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-15T20%3A58%3A51IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Vision-Language%20Models%20as%20Success%20Detectors&rft.au=Du,%20Yuqing&rft.date=2023-03-13&rft_id=info:doi/10.48550/arxiv.2303.07280&rft_dat=%3Carxiv_GOX%3E2303_07280%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |