ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities
Recent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), ba...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Recent large language models (LLMs) advancements sparked a growing research
interest in tool assisted LLMs solving real-world challenges, which calls for
comprehensive evaluation of tool-use capabilities. While previous works focused
on either evaluating over stateless web services (RESTful API), based on a
single turn user prompt, or an off-policy dialog trajectory, ToolSandbox
includes stateful tool execution, implicit state dependencies between tools, a
built-in user simulator supporting on-policy conversational evaluation and a
dynamic evaluation strategy for intermediate and final milestones over an
arbitrary trajectory. We show that open source and proprietary models have a
significant performance gap, and complex tasks like State Dependency,
Canonicalization and Insufficient Information defined in ToolSandbox are
challenging even the most capable SOTA LLMs, providing brand-new insights into
tool-use LLM capabilities. ToolSandbox evaluation framework is released at
https://github.com/apple/ToolSandbox |
---|---|
DOI: | 10.48550/arxiv.2408.04682 |