Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

Building a general-purpose agent is a long-standing vision in the field of artificial intelligence. Existing agents have made remarkable progress in many domains, yet they still struggle to complete long-horizon tasks in an open world. We attribute this to the lack of necessary world knowledge and m...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Li, Zaijing, Xie, Yuquan, Shao, Rui, Chen, Gongwei, Jiang, Dongmei, Nie, Liqiang
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Computation and Language
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Li, Zaijing Xie, Yuquan Shao, Rui Chen, Gongwei Jiang, Dongmei Nie, Liqiang
description	Building a general-purpose agent is a long-standing vision in the field of artificial intelligence. Existing agents have made remarkable progress in many domains, yet they still struggle to complete long-horizon tasks in an open world. We attribute this to the lack of necessary world knowledge and multimodal experience that can guide agents through a variety of long-horizon tasks. In this paper, we propose a Hybrid Multimodal Memory module to address the above challenges. It 1) transforms knowledge into Hierarchical Directed Knowledge Graph that allows agents to explicitly represent and learn world knowledge, and 2) summarises historical information into Abstracted Multimodal Experience Pool that provide agents with rich references for in-context learning. On top of the Hybrid Multimodal Memory module, a multimodal agent, Optimus-1, is constructed with dedicated Knowledge-guided Planner and Experience-Driven Reflector, contributing to a better planning and reflection in the face of long-horizon tasks in Minecraft. Extensive experimental results show that Optimus-1 significantly outperforms all existing agents on challenging long-horizon task benchmarks, and exhibits near human-level performance on many tasks. In addition, we introduce various Multimodal Large Language Models (MLLMs) as the backbone of Optimus-1. Experimental results show that Optimus-1 exhibits strong generalization with the help of the Hybrid Multimodal Memory module, outperforming the GPT-4V baseline on many tasks.
doi_str_mv	10.48550/arxiv.2408.03615
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2408_03615</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2408_03615</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2408_036153</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw0DMwNjM05WQI8C8oycwtLdY1tFLwqEwqykxR8C3NAQrlpyTmKPim5uYXVSq45hbkl6cWpaYoOKan5pUUK7hWJKfmKGTmKfjk56XreuQXZVbl5ymEJBZnF_MwsKYl5hSn8kJpbgZ5N9cQZw9dsN3xBUWZuYlFlfEgN8SD3WBMWAUAqRw79Q</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks</title><source>arXiv.org</source><creator>Li, Zaijing ; Xie, Yuquan ; Shao, Rui ; Chen, Gongwei ; Jiang, Dongmei ; Nie, Liqiang</creator><creatorcontrib>Li, Zaijing ; Xie, Yuquan ; Shao, Rui ; Chen, Gongwei ; Jiang, Dongmei ; Nie, Liqiang</creatorcontrib><description>Building a general-purpose agent is a long-standing vision in the field of artificial intelligence. Existing agents have made remarkable progress in many domains, yet they still struggle to complete long-horizon tasks in an open world. We attribute this to the lack of necessary world knowledge and multimodal experience that can guide agents through a variety of long-horizon tasks. In this paper, we propose a Hybrid Multimodal Memory module to address the above challenges. It 1) transforms knowledge into Hierarchical Directed Knowledge Graph that allows agents to explicitly represent and learn world knowledge, and 2) summarises historical information into Abstracted Multimodal Experience Pool that provide agents with rich references for in-context learning. On top of the Hybrid Multimodal Memory module, a multimodal agent, Optimus-1, is constructed with dedicated Knowledge-guided Planner and Experience-Driven Reflector, contributing to a better planning and reflection in the face of long-horizon tasks in Minecraft. Extensive experimental results show that Optimus-1 significantly outperforms all existing agents on challenging long-horizon task benchmarks, and exhibits near human-level performance on many tasks. In addition, we introduce various Multimodal Large Language Models (MLLMs) as the backbone of Optimus-1. Experimental results show that Optimus-1 exhibits strong generalization with the help of the Hybrid Multimodal Memory module, outperforming the GPT-4V baseline on many tasks.</description><identifier>DOI: 10.48550/arxiv.2408.03615</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language</subject><creationdate>2024-08</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2408.03615$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2408.03615$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Li, Zaijing</creatorcontrib><creatorcontrib>Xie, Yuquan</creatorcontrib><creatorcontrib>Shao, Rui</creatorcontrib><creatorcontrib>Chen, Gongwei</creatorcontrib><creatorcontrib>Jiang, Dongmei</creatorcontrib><creatorcontrib>Nie, Liqiang</creatorcontrib><title>Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks</title><description>Building a general-purpose agent is a long-standing vision in the field of artificial intelligence. Existing agents have made remarkable progress in many domains, yet they still struggle to complete long-horizon tasks in an open world. We attribute this to the lack of necessary world knowledge and multimodal experience that can guide agents through a variety of long-horizon tasks. In this paper, we propose a Hybrid Multimodal Memory module to address the above challenges. It 1) transforms knowledge into Hierarchical Directed Knowledge Graph that allows agents to explicitly represent and learn world knowledge, and 2) summarises historical information into Abstracted Multimodal Experience Pool that provide agents with rich references for in-context learning. On top of the Hybrid Multimodal Memory module, a multimodal agent, Optimus-1, is constructed with dedicated Knowledge-guided Planner and Experience-Driven Reflector, contributing to a better planning and reflection in the face of long-horizon tasks in Minecraft. Extensive experimental results show that Optimus-1 significantly outperforms all existing agents on challenging long-horizon task benchmarks, and exhibits near human-level performance on many tasks. In addition, we introduce various Multimodal Large Language Models (MLLMs) as the backbone of Optimus-1. Experimental results show that Optimus-1 exhibits strong generalization with the help of the Hybrid Multimodal Memory module, outperforming the GPT-4V baseline on many tasks.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw0DMwNjM05WQI8C8oycwtLdY1tFLwqEwqykxR8C3NAQrlpyTmKPim5uYXVSq45hbkl6cWpaYoOKan5pUUK7hWJKfmKGTmKfjk56XreuQXZVbl5ymEJBZnF_MwsKYl5hSn8kJpbgZ5N9cQZw9dsN3xBUWZuYlFlfEgN8SD3WBMWAUAqRw79Q</recordid><startdate>20240807</startdate><enddate>20240807</enddate><creator>Li, Zaijing</creator><creator>Xie, Yuquan</creator><creator>Shao, Rui</creator><creator>Chen, Gongwei</creator><creator>Jiang, Dongmei</creator><creator>Nie, Liqiang</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240807</creationdate><title>Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks</title><author>Li, Zaijing ; Xie, Yuquan ; Shao, Rui ; Chen, Gongwei ; Jiang, Dongmei ; Nie, Liqiang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2408_036153</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><toplevel>online_resources</toplevel><creatorcontrib>Li, Zaijing</creatorcontrib><creatorcontrib>Xie, Yuquan</creatorcontrib><creatorcontrib>Shao, Rui</creatorcontrib><creatorcontrib>Chen, Gongwei</creatorcontrib><creatorcontrib>Jiang, Dongmei</creatorcontrib><creatorcontrib>Nie, Liqiang</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Li, Zaijing</au><au>Xie, Yuquan</au><au>Shao, Rui</au><au>Chen, Gongwei</au><au>Jiang, Dongmei</au><au>Nie, Liqiang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks</atitle><date>2024-08-07</date><risdate>2024</risdate><abstract>Building a general-purpose agent is a long-standing vision in the field of artificial intelligence. Existing agents have made remarkable progress in many domains, yet they still struggle to complete long-horizon tasks in an open world. We attribute this to the lack of necessary world knowledge and multimodal experience that can guide agents through a variety of long-horizon tasks. In this paper, we propose a Hybrid Multimodal Memory module to address the above challenges. It 1) transforms knowledge into Hierarchical Directed Knowledge Graph that allows agents to explicitly represent and learn world knowledge, and 2) summarises historical information into Abstracted Multimodal Experience Pool that provide agents with rich references for in-context learning. On top of the Hybrid Multimodal Memory module, a multimodal agent, Optimus-1, is constructed with dedicated Knowledge-guided Planner and Experience-Driven Reflector, contributing to a better planning and reflection in the face of long-horizon tasks in Minecraft. Extensive experimental results show that Optimus-1 significantly outperforms all existing agents on challenging long-horizon task benchmarks, and exhibits near human-level performance on many tasks. In addition, we introduce various Multimodal Large Language Models (MLLMs) as the backbone of Optimus-1. Experimental results show that Optimus-1 exhibits strong generalization with the help of the Hybrid Multimodal Memory module, outperforming the GPT-4V baseline on many tasks.</abstract><doi>10.48550/arxiv.2408.03615</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2408.03615
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2408_03615
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Computation and Language
title	Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-27T08%3A21%3A10IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Optimus-1:%20Hybrid%20Multimodal%20Memory%20Empowered%20Agents%20Excel%20in%20Long-Horizon%20Tasks&rft.au=Li,%20Zaijing&rft.date=2024-08-07&rft_id=info:doi/10.48550/arxiv.2408.03615&rft_dat=%3Carxiv_GOX%3E2408_03615%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true