Vision-Language Models as Success Detectors

Detecting successful behaviour is crucial for training intelligent agents. As such, generalisable reward models are a prerequisite for agents that can learn to generalise their behaviour. In this work we focus on developing robust success detectors that leverage large, pretrained vision-language mod...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Du, Yuqing, Konyushkova, Ksenia, Denil, Misha, Raju, Akhil, Landon, Jessica, Hill, Felix, de Freitas, Nando, Cabi, Serkan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Du, Yuqing
Konyushkova, Ksenia
Denil, Misha
Raju, Akhil
Landon, Jessica
Hill, Felix
de Freitas, Nando
Cabi, Serkan
description Detecting successful behaviour is crucial for training intelligent agents. As such, generalisable reward models are a prerequisite for agents that can learn to generalise their behaviour. In this work we focus on developing robust success detectors that leverage large, pretrained vision-language models (Flamingo, Alayrac et al. (2022)) and human reward annotations. Concretely, we treat success detection as a visual question answering (VQA) problem, denoted SuccessVQA. We study success detection across three vastly different domains: (i) interactive language-conditioned agents in a simulated household, (ii) real world robotic manipulation, and (iii) "in-the-wild" human egocentric videos. We investigate the generalisation properties of a Flamingo-based success detection model across unseen language and visual changes in the first two domains, and find that the proposed method is able to outperform bespoke reward models in out-of-distribution test scenarios with either variation. In the last domain of "in-the-wild" human videos, we show that success detection on unseen real videos presents an even more challenging generalisation task warranting future work. We hope our initial results encourage further work in real world success detection and reward modelling.
doi_str_mv 10.48550/arxiv.2303.07280
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2303_07280</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2303_07280</sourcerecordid><originalsourceid>FETCH-LOGICAL-a670-b72872a8b39b83bd61dfeb66627579617d7cf02daf582e8611ce829e55f5f8bf3</originalsourceid><addsrcrecordid>eNotzrtuwjAUgGEvDAh4AKZmrxJ8qS8ZEbciBTEQdY2O7WMUKSQoBkTfvhSY_u3XR8iU0ezLSEln0N_rW8YFFRnV3NAh-fypY921aQHt8QpHTHadxyYmEJPD1TmMMVniBd2l6-OYDAI0ESfvjki5XpWL77TYb7aLeZGC0jS1j7HmYKzIrRHWK-YDWqUU11LnimmvXaDcQ5CGo1GMOTQ8RymDDMYGMSIfr-1TW537-gT9b_Wvrp5q8QeMKzvR</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Vision-Language Models as Success Detectors</title><source>arXiv.org</source><creator>Du, Yuqing ; Konyushkova, Ksenia ; Denil, Misha ; Raju, Akhil ; Landon, Jessica ; Hill, Felix ; de Freitas, Nando ; Cabi, Serkan</creator><creatorcontrib>Du, Yuqing ; Konyushkova, Ksenia ; Denil, Misha ; Raju, Akhil ; Landon, Jessica ; Hill, Felix ; de Freitas, Nando ; Cabi, Serkan</creatorcontrib><description>Detecting successful behaviour is crucial for training intelligent agents. As such, generalisable reward models are a prerequisite for agents that can learn to generalise their behaviour. In this work we focus on developing robust success detectors that leverage large, pretrained vision-language models (Flamingo, Alayrac et al. (2022)) and human reward annotations. Concretely, we treat success detection as a visual question answering (VQA) problem, denoted SuccessVQA. We study success detection across three vastly different domains: (i) interactive language-conditioned agents in a simulated household, (ii) real world robotic manipulation, and (iii) "in-the-wild" human egocentric videos. We investigate the generalisation properties of a Flamingo-based success detection model across unseen language and visual changes in the first two domains, and find that the proposed method is able to outperform bespoke reward models in out-of-distribution test scenarios with either variation. In the last domain of "in-the-wild" human videos, we show that success detection on unseen real videos presents an even more challenging generalisation task warranting future work. We hope our initial results encourage further work in real world success detection and reward modelling.</description><identifier>DOI: 10.48550/arxiv.2303.07280</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Learning</subject><creationdate>2023-03</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,778,883</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2303.07280$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2303.07280$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Du, Yuqing</creatorcontrib><creatorcontrib>Konyushkova, Ksenia</creatorcontrib><creatorcontrib>Denil, Misha</creatorcontrib><creatorcontrib>Raju, Akhil</creatorcontrib><creatorcontrib>Landon, Jessica</creatorcontrib><creatorcontrib>Hill, Felix</creatorcontrib><creatorcontrib>de Freitas, Nando</creatorcontrib><creatorcontrib>Cabi, Serkan</creatorcontrib><title>Vision-Language Models as Success Detectors</title><description>Detecting successful behaviour is crucial for training intelligent agents. As such, generalisable reward models are a prerequisite for agents that can learn to generalise their behaviour. In this work we focus on developing robust success detectors that leverage large, pretrained vision-language models (Flamingo, Alayrac et al. (2022)) and human reward annotations. Concretely, we treat success detection as a visual question answering (VQA) problem, denoted SuccessVQA. We study success detection across three vastly different domains: (i) interactive language-conditioned agents in a simulated household, (ii) real world robotic manipulation, and (iii) "in-the-wild" human egocentric videos. We investigate the generalisation properties of a Flamingo-based success detection model across unseen language and visual changes in the first two domains, and find that the proposed method is able to outperform bespoke reward models in out-of-distribution test scenarios with either variation. In the last domain of "in-the-wild" human videos, we show that success detection on unseen real videos presents an even more challenging generalisation task warranting future work. We hope our initial results encourage further work in real world success detection and reward modelling.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotzrtuwjAUgGEvDAh4AKZmrxJ8qS8ZEbciBTEQdY2O7WMUKSQoBkTfvhSY_u3XR8iU0ezLSEln0N_rW8YFFRnV3NAh-fypY921aQHt8QpHTHadxyYmEJPD1TmMMVniBd2l6-OYDAI0ESfvjki5XpWL77TYb7aLeZGC0jS1j7HmYKzIrRHWK-YDWqUU11LnimmvXaDcQ5CGo1GMOTQ8RymDDMYGMSIfr-1TW537-gT9b_Wvrp5q8QeMKzvR</recordid><startdate>20230313</startdate><enddate>20230313</enddate><creator>Du, Yuqing</creator><creator>Konyushkova, Ksenia</creator><creator>Denil, Misha</creator><creator>Raju, Akhil</creator><creator>Landon, Jessica</creator><creator>Hill, Felix</creator><creator>de Freitas, Nando</creator><creator>Cabi, Serkan</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230313</creationdate><title>Vision-Language Models as Success Detectors</title><author>Du, Yuqing ; Konyushkova, Ksenia ; Denil, Misha ; Raju, Akhil ; Landon, Jessica ; Hill, Felix ; de Freitas, Nando ; Cabi, Serkan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a670-b72872a8b39b83bd61dfeb66627579617d7cf02daf582e8611ce829e55f5f8bf3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Du, Yuqing</creatorcontrib><creatorcontrib>Konyushkova, Ksenia</creatorcontrib><creatorcontrib>Denil, Misha</creatorcontrib><creatorcontrib>Raju, Akhil</creatorcontrib><creatorcontrib>Landon, Jessica</creatorcontrib><creatorcontrib>Hill, Felix</creatorcontrib><creatorcontrib>de Freitas, Nando</creatorcontrib><creatorcontrib>Cabi, Serkan</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Du, Yuqing</au><au>Konyushkova, Ksenia</au><au>Denil, Misha</au><au>Raju, Akhil</au><au>Landon, Jessica</au><au>Hill, Felix</au><au>de Freitas, Nando</au><au>Cabi, Serkan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Vision-Language Models as Success Detectors</atitle><date>2023-03-13</date><risdate>2023</risdate><abstract>Detecting successful behaviour is crucial for training intelligent agents. As such, generalisable reward models are a prerequisite for agents that can learn to generalise their behaviour. In this work we focus on developing robust success detectors that leverage large, pretrained vision-language models (Flamingo, Alayrac et al. (2022)) and human reward annotations. Concretely, we treat success detection as a visual question answering (VQA) problem, denoted SuccessVQA. We study success detection across three vastly different domains: (i) interactive language-conditioned agents in a simulated household, (ii) real world robotic manipulation, and (iii) "in-the-wild" human egocentric videos. We investigate the generalisation properties of a Flamingo-based success detection model across unseen language and visual changes in the first two domains, and find that the proposed method is able to outperform bespoke reward models in out-of-distribution test scenarios with either variation. In the last domain of "in-the-wild" human videos, we show that success detection on unseen real videos presents an even more challenging generalisation task warranting future work. We hope our initial results encourage further work in real world success detection and reward modelling.</abstract><doi>10.48550/arxiv.2303.07280</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2303.07280
ispartof
issn
language eng
recordid cdi_arxiv_primary_2303_07280
source arXiv.org
subjects Computer Science - Artificial Intelligence
Computer Science - Computer Vision and Pattern Recognition
Computer Science - Learning
title Vision-Language Models as Success Detectors
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-15T20%3A58%3A51IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Vision-Language%20Models%20as%20Success%20Detectors&rft.au=Du,%20Yuqing&rft.date=2023-03-13&rft_id=info:doi/10.48550/arxiv.2303.07280&rft_dat=%3Carxiv_GOX%3E2303_07280%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true