Enhancing Robotic Manipulation with AI Feedback from Multimodal Large Language Models
Recently, there has been considerable attention towards leveraging large language models (LLMs) to enhance decision-making processes. However, aligning the natural language text instructions generated by LLMs with the vectorized operations required for execution presents a significant challenge, oft...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Recently, there has been considerable attention towards leveraging large
language models (LLMs) to enhance decision-making processes. However, aligning
the natural language text instructions generated by LLMs with the vectorized
operations required for execution presents a significant challenge, often
necessitating task-specific details. To circumvent the need for such
task-specific granularity, inspired by preference-based policy learning
approaches, we investigate the utilization of multimodal LLMs to provide
automated preference feedback solely from image inputs to guide
decision-making. In this study, we train a multimodal LLM, termed CriticGPT,
capable of understanding trajectory videos in robot manipulation tasks, serving
as a critic to offer analysis and preference feedback. Subsequently, we
validate the effectiveness of preference labels generated by CriticGPT from a
reward modeling perspective. Experimental evaluation of the algorithm's
preference accuracy demonstrates its effective generalization ability to new
tasks. Furthermore, performance on Meta-World tasks reveals that CriticGPT's
reward model efficiently guides policy learning, surpassing rewards based on
state-of-the-art pre-trained representation models. |
---|---|
DOI: | 10.48550/arxiv.2402.14245 |