The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input
We introduce FACTS Grounding, an online leaderboard and associated benchmark that evaluates language models' ability to generate text that is factually accurate with respect to given context in the user prompt. In our benchmark, each prompt includes a user request and a full document, with a ma...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We introduce FACTS Grounding, an online leaderboard and associated benchmark
that evaluates language models' ability to generate text that is factually
accurate with respect to given context in the user prompt. In our benchmark,
each prompt includes a user request and a full document, with a maximum length
of 32k tokens, requiring long-form responses. The long-form responses are
required to be fully grounded in the provided context document while fulfilling
the user request. Models are evaluated using automated judge models in two
phases: (1) responses are disqualified if they do not fulfill the user request;
(2) they are judged as accurate if the response is fully grounded in the
provided document. The automated judge models were comprehensively evaluated
against a held-out test-set to pick the best prompt template, and the final
factuality score is an aggregate of multiple judge models to mitigate
evaluation bias. The FACTS Grounding leaderboard will be actively maintained
over time, and contains both public and private splits to allow for external
participation while guarding the integrity of the leaderboard. It can be found
at https://www.kaggle.com/facts-leaderboard. |
---|---|
DOI: | 10.48550/arxiv.2501.03200 |