FaithDial : A Faithful Benchmark for Information-Seeking Dialogue

The goal of information-seeking dialogue is to respond to seeker queries with natural language utterances that are grounded on knowledge sources. However, dialogue systems often produce unsupported utterances, a phenomenon known as hallucination. To mitigate this behavior, we adopt a data-centric so...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Transactions of the Association for Computational Linguistics 2022-12, Vol.10, p.1473-1490
Hauptverfasser:	Dziri, Nouha, Kamalloo, Ehsan, Milton, Sivan, Zaiane, Osmar, Yu, Mo, Ponti, Edoardo M., Reddy, Siva
Format:	Artikel
Sprache:	eng
Schlagworte:	Benchmarks Coherence Datasets Dialogue Hallucinations Human-computer interaction Natural language
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The goal of information-seeking dialogue is to respond to seeker queries with natural language utterances that are grounded on knowledge sources. However, dialogue systems often produce unsupported utterances, a phenomenon known as hallucination. To mitigate this behavior, we adopt a data-centric solution and create , a new benchmark for hallucination-free dialogues, by editing hallucinated responses in the Wizard of Wikipedia ( ) benchmark. We observe that is more faithful than WoW while also maintaining engaging conversations. We show that can serve as training signal for: ) a hallucination critic, which discriminates whether an utterance is faithful or not, and boosts the performance by 12.8 F1 score on the BEGIN benchmark compared to existing datasets for dialogue coherence; ) high-quality dialogue generation. We benchmark a series of state-of-the-art models and propose an auxiliary contrastive objective that achieves the highest level of faithfulness and abstractiveness based on several automated metrics. Further, we find that the benefits of generalize to zero-shot transfer on other datasets, such as CMU-Dog and TopicalChat. Finally, human evaluation reveals that responses generated by models trained on are perceived as more interpretable, cooperative, and engaging.
ISSN:	2307-387X 2307-387X
DOI:	10.1162/tacl_a_00529