MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback
To solve complex tasks, large language models (LLMs) often require multiple rounds of interactions with the user, sometimes assisted by external tools. However, current evaluation protocols often emphasize benchmark performance with single-turn exchanges, neglecting the nuanced interactions among th...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | To solve complex tasks, large language models (LLMs) often require multiple
rounds of interactions with the user, sometimes assisted by external tools.
However, current evaluation protocols often emphasize benchmark performance
with single-turn exchanges, neglecting the nuanced interactions among the user,
LLMs, and external tools, while also underestimating the importance of natural
language feedback from users. These oversights contribute to discrepancies
between research benchmark evaluations and real-world use cases. We introduce
MINT, a benchmark that evaluates LLMs' ability to solve tasks with multi-turn
interactions by (1) using tools and (2) leveraging natural language feedback.
To ensure reproducibility, we provide an evaluation framework where LLMs can
access tools by executing Python code and receive users' natural language
feedback simulated by GPT-4. We repurpose a diverse set of established
evaluation datasets focusing on reasoning, coding, and decision-making and
carefully curate them into a compact subset for efficient evaluation. Our
analysis of 20 open- and closed-source LLMs offers intriguing findings. (a)
LLMs generally benefit from tools and language feedback, with performance gains
(absolute, same below) of 1-8% for each turn of tool use and 2-17% with natural
language feedback. (b) Better single-turn performance does not guarantee better
multi-turn performance. (c) Surprisingly, on the LLMs evaluated, supervised
instruction-finetuning (SIFT) and reinforcement learning from human feedback
(RLHF) generally hurt multi-turn capabilities. We expect MINT can help measure
progress and incentivize research in improving LLMs' capabilities in multi-turn
interactions, especially for open-source communities where multi-turn human
evaluation can be less accessible compared to commercial LLMs with a larger
user base. |
---|---|
DOI: | 10.48550/arxiv.2309.10691 |