beta$-calibration of Language Model Confidence Scores for Generative QA
To use generative question-and-answering (QA) systems for decision-making and in any critical application, these systems need to provide well-calibrated confidence scores that reflect the correctness of their answers. Existing calibration methods aim to ensure that the confidence score is on average...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | To use generative question-and-answering (QA) systems for decision-making and
in any critical application, these systems need to provide well-calibrated
confidence scores that reflect the correctness of their answers. Existing
calibration methods aim to ensure that the confidence score is on average
indicative of the likelihood that the answer is correct. We argue, however,
that this standard (average-case) notion of calibration is difficult to
interpret for decision-making in generative QA. To address this, we generalize
the standard notion of average calibration and introduce $\beta$-calibration,
which ensures calibration holds across different question-and-answer groups. We
then propose discretized posthoc calibration schemes for achieving
$\beta$-calibration. |
---|---|
DOI: | 10.48550/arxiv.2410.06615 |