Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training
Large language models (LLMs) aligned through reinforcement learning from human feedback (RLHF) have quickly become one of the dominant paradigms for building intelligent conversational assistant agents. However, despite their strong performance across many benchmarks, LLM-based agents still lack con...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Large language models (LLMs) aligned through reinforcement learning from
human feedback (RLHF) have quickly become one of the dominant paradigms for
building intelligent conversational assistant agents. However, despite their
strong performance across many benchmarks, LLM-based agents still lack
conversational skills such as disambiguation: when generalized assistants are
faced with ambiguity, they often overhedge or implicitly guess users'
ground-truth intents rather than asking clarification questions, and under
task-specific settings, high-quality conversation samples are often limited,
affecting models' ability to learn optimal dialogue action policies. We propose
Action-Based Contrastive Self-Training (henceforth ACT), a quasi-online
preference optimization algorithm based on Direct Preference Optimization (DPO)
which allows for sample-efficient dialogue policy learning in multi-turn
conversation. We demonstrate ACT's efficacy under sample-efficient conditions
in three difficult conversational tasks: tabular-grounded question-answering,
machine reading comprehension, and AmbigSQL, a novel task for disambiguating
information-seeking requests for text-to-SQL generation. Additionally, we
propose evaluating LLMs' ability to function as conversational agents by
examining whether they can implicitly recognize and reason about ambiguity in
conversation. ACT demonstrates substantial conversation modeling improvements
over standard approaches to supervised fine-tuning and DPO. |
---|---|
DOI: | 10.48550/arxiv.2406.00222 |