Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

Large Language Models (LLMs) have shown remarkable capabilities in natural language tasks requiring complex reasoning, yet their application in agentic, multi-step reasoning within interactive environments remains a difficult challenge. Traditional supervised pre-training on static datasets falls sh...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Putta, Pranav, Mills, Edmund, Garg, Naman, Motwani, Sumeet, Finn, Chelsea, Garg, Divyansh, Rafailov, Rafael
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Putta, Pranav Mills, Edmund Garg, Naman Motwani, Sumeet Finn, Chelsea Garg, Divyansh Rafailov, Rafael
description	Large Language Models (LLMs) have shown remarkable capabilities in natural language tasks requiring complex reasoning, yet their application in agentic, multi-step reasoning within interactive environments remains a difficult challenge. Traditional supervised pre-training on static datasets falls short in enabling autonomous agent capabilities needed to perform complex decision-making in dynamic settings like web navigation. Previous attempts to bridge this ga-through supervised fine-tuning on curated expert demonstrations-often suffer from compounding errors and limited exploration data, resulting in sub-optimal policy outcomes. To overcome these challenges, we propose a framework that combines guided Monte Carlo Tree Search (MCTS) search with a self-critique mechanism and iterative fine-tuning on agent interactions using an off-policy variant of the Direct Preference Optimization (DPO) algorithm. Our method allows LLM agents to learn effectively from both successful and unsuccessful trajectories, thereby improving their generalization in complex, multi-step reasoning tasks. We validate our approach in the WebShop environment-a simulated e-commerce platform where it consistently outperforms behavior cloning and reinforced fine-tuning baseline, and beats average human performance when equipped with the capability to do online search. In real-world booking scenarios, our methodology boosts Llama-3 70B model's zero-shot performance from 18.6% to 81.7% success rate (a 340% relative increase) after a single day of data collection and further to 95.4% with online search. We believe this represents a substantial leap forward in the capabilities of autonomous agents, paving the way for more sophisticated and reliable decision-making in real-world settings.
doi_str_mv	10.48550/arxiv.2408.07199
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2408_07199</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2408_07199</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2408_071993</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw0DMwN7S05GRwdExPzStRCLRScEwpS8xLTk1RCEpNLM7Py8xLV0jMS1HwSU0sAnPS8osUHEtL8vPyc_NLixUcPRXAWot5GFjTEnOKU3mhNDeDvJtriLOHLtiy-IKizNzEosp4kKXxYEuNCasAAK2ENhM</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents</title><source>arXiv.org</source><creator>Putta, Pranav ; Mills, Edmund ; Garg, Naman ; Motwani, Sumeet ; Finn, Chelsea ; Garg, Divyansh ; Rafailov, Rafael</creator><creatorcontrib>Putta, Pranav ; Mills, Edmund ; Garg, Naman ; Motwani, Sumeet ; Finn, Chelsea ; Garg, Divyansh ; Rafailov, Rafael</creatorcontrib><description>Large Language Models (LLMs) have shown remarkable capabilities in natural language tasks requiring complex reasoning, yet their application in agentic, multi-step reasoning within interactive environments remains a difficult challenge. Traditional supervised pre-training on static datasets falls short in enabling autonomous agent capabilities needed to perform complex decision-making in dynamic settings like web navigation. Previous attempts to bridge this ga-through supervised fine-tuning on curated expert demonstrations-often suffer from compounding errors and limited exploration data, resulting in sub-optimal policy outcomes. To overcome these challenges, we propose a framework that combines guided Monte Carlo Tree Search (MCTS) search with a self-critique mechanism and iterative fine-tuning on agent interactions using an off-policy variant of the Direct Preference Optimization (DPO) algorithm. Our method allows LLM agents to learn effectively from both successful and unsuccessful trajectories, thereby improving their generalization in complex, multi-step reasoning tasks. We validate our approach in the WebShop environment-a simulated e-commerce platform where it consistently outperforms behavior cloning and reinforced fine-tuning baseline, and beats average human performance when equipped with the capability to do online search. In real-world booking scenarios, our methodology boosts Llama-3 70B model's zero-shot performance from 18.6% to 81.7% success rate (a 340% relative increase) after a single day of data collection and further to 95.4% with online search. We believe this represents a substantial leap forward in the capabilities of autonomous agents, paving the way for more sophisticated and reliable decision-making in real-world settings.</description><identifier>DOI: 10.48550/arxiv.2408.07199</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Learning</subject><creationdate>2024-08</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2408.07199$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2408.07199$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Putta, Pranav</creatorcontrib><creatorcontrib>Mills, Edmund</creatorcontrib><creatorcontrib>Garg, Naman</creatorcontrib><creatorcontrib>Motwani, Sumeet</creatorcontrib><creatorcontrib>Finn, Chelsea</creatorcontrib><creatorcontrib>Garg, Divyansh</creatorcontrib><creatorcontrib>Rafailov, Rafael</creatorcontrib><title>Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents</title><description>Large Language Models (LLMs) have shown remarkable capabilities in natural language tasks requiring complex reasoning, yet their application in agentic, multi-step reasoning within interactive environments remains a difficult challenge. Traditional supervised pre-training on static datasets falls short in enabling autonomous agent capabilities needed to perform complex decision-making in dynamic settings like web navigation. Previous attempts to bridge this ga-through supervised fine-tuning on curated expert demonstrations-often suffer from compounding errors and limited exploration data, resulting in sub-optimal policy outcomes. To overcome these challenges, we propose a framework that combines guided Monte Carlo Tree Search (MCTS) search with a self-critique mechanism and iterative fine-tuning on agent interactions using an off-policy variant of the Direct Preference Optimization (DPO) algorithm. Our method allows LLM agents to learn effectively from both successful and unsuccessful trajectories, thereby improving their generalization in complex, multi-step reasoning tasks. We validate our approach in the WebShop environment-a simulated e-commerce platform where it consistently outperforms behavior cloning and reinforced fine-tuning baseline, and beats average human performance when equipped with the capability to do online search. In real-world booking scenarios, our methodology boosts Llama-3 70B model's zero-shot performance from 18.6% to 81.7% success rate (a 340% relative increase) after a single day of data collection and further to 95.4% with online search. We believe this represents a substantial leap forward in the capabilities of autonomous agents, paving the way for more sophisticated and reliable decision-making in real-world settings.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw0DMwN7S05GRwdExPzStRCLRScEwpS8xLTk1RCEpNLM7Py8xLV0jMS1HwSU0sAnPS8osUHEtL8vPyc_NLixUcPRXAWot5GFjTEnOKU3mhNDeDvJtriLOHLtiy-IKizNzEosp4kKXxYEuNCasAAK2ENhM</recordid><startdate>20240813</startdate><enddate>20240813</enddate><creator>Putta, Pranav</creator><creator>Mills, Edmund</creator><creator>Garg, Naman</creator><creator>Motwani, Sumeet</creator><creator>Finn, Chelsea</creator><creator>Garg, Divyansh</creator><creator>Rafailov, Rafael</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240813</creationdate><title>Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents</title><author>Putta, Pranav ; Mills, Edmund ; Garg, Naman ; Motwani, Sumeet ; Finn, Chelsea ; Garg, Divyansh ; Rafailov, Rafael</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2408_071993</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Putta, Pranav</creatorcontrib><creatorcontrib>Mills, Edmund</creatorcontrib><creatorcontrib>Garg, Naman</creatorcontrib><creatorcontrib>Motwani, Sumeet</creatorcontrib><creatorcontrib>Finn, Chelsea</creatorcontrib><creatorcontrib>Garg, Divyansh</creatorcontrib><creatorcontrib>Rafailov, Rafael</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Putta, Pranav</au><au>Mills, Edmund</au><au>Garg, Naman</au><au>Motwani, Sumeet</au><au>Finn, Chelsea</au><au>Garg, Divyansh</au><au>Rafailov, Rafael</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents</atitle><date>2024-08-13</date><risdate>2024</risdate><abstract>Large Language Models (LLMs) have shown remarkable capabilities in natural language tasks requiring complex reasoning, yet their application in agentic, multi-step reasoning within interactive environments remains a difficult challenge. Traditional supervised pre-training on static datasets falls short in enabling autonomous agent capabilities needed to perform complex decision-making in dynamic settings like web navigation. Previous attempts to bridge this ga-through supervised fine-tuning on curated expert demonstrations-often suffer from compounding errors and limited exploration data, resulting in sub-optimal policy outcomes. To overcome these challenges, we propose a framework that combines guided Monte Carlo Tree Search (MCTS) search with a self-critique mechanism and iterative fine-tuning on agent interactions using an off-policy variant of the Direct Preference Optimization (DPO) algorithm. Our method allows LLM agents to learn effectively from both successful and unsuccessful trajectories, thereby improving their generalization in complex, multi-step reasoning tasks. We validate our approach in the WebShop environment-a simulated e-commerce platform where it consistently outperforms behavior cloning and reinforced fine-tuning baseline, and beats average human performance when equipped with the capability to do online search. In real-world booking scenarios, our methodology boosts Llama-3 70B model's zero-shot performance from 18.6% to 81.7% success rate (a 340% relative increase) after a single day of data collection and further to 95.4% with online search. We believe this represents a substantial leap forward in the capabilities of autonomous agents, paving the way for more sophisticated and reliable decision-making in real-world settings.</abstract><doi>10.48550/arxiv.2408.07199</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2408.07199
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2408_07199
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Learning
title	Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-02T15%3A55%3A40IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Agent%20Q:%20Advanced%20Reasoning%20and%20Learning%20for%20Autonomous%20AI%20Agents&rft.au=Putta,%20Pranav&rft.date=2024-08-13&rft_id=info:doi/10.48550/arxiv.2408.07199&rft_dat=%3Carxiv_GOX%3E2408_07199%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true