ShortcutsBench: A Large-Scale Real-world Benchmark for API-based Agents
Recent advancements in integrating large language models (LLMs) with application programming interfaces (APIs) have gained significant interest in both academia and industry. These API-based agents, leveraging the strong autonomy and planning capabilities of LLMs, can efficiently solve problems requ...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Recent advancements in integrating large language models (LLMs) with
application programming interfaces (APIs) have gained significant interest in
both academia and industry. These API-based agents, leveraging the strong
autonomy and planning capabilities of LLMs, can efficiently solve problems
requiring multi-step actions. However, their ability to handle
multi-dimensional difficulty levels, diverse task types, and real-world demands
through APIs remains unknown. In this paper, we introduce
\textsc{ShortcutsBench}, a large-scale benchmark for the comprehensive
evaluation of API-based agents in solving tasks with varying levels of
difficulty, diverse task types, and real-world demands. \textsc{ShortcutsBench}
includes a wealth of real APIs from Apple Inc.'s operating systems, refined
user queries from shortcuts, human-annotated high-quality action sequences from
shortcut developers, and accurate parameter filling values about primitive
parameter types, enum parameter types, outputs from previous actions, and
parameters that need to request necessary information from the system or user.
Our extensive evaluation of agents built with $5$ leading open-source (size >=
57B) and $4$ closed-source LLMs (e.g. Gemini-1.5-Pro and GPT-3.5) reveals
significant limitations in handling complex queries related to API selection,
parameter filling, and requesting necessary information from systems and users.
These findings highlight the challenges that API-based agents face in
effectively fulfilling real and complex user queries. All datasets, code, and
experimental results will be available at
\url{https://github.com/eachsheep/shortcutsbench}. |
---|---|
DOI: | 10.48550/arxiv.2407.00132 |