Predictions from language models for multiple-choice tasks are not robust under variation of scoring methods
This paper systematically compares different methods of deriving item-level predictions of language models for multiple-choice tasks. It compares scoring methods for answer options based on free generation of responses, various probability-based scores, a Likert-scale style rating method, and embedd...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This paper systematically compares different methods of deriving item-level
predictions of language models for multiple-choice tasks. It compares scoring
methods for answer options based on free generation of responses, various
probability-based scores, a Likert-scale style rating method, and embedding
similarity. In a case study on pragmatic language interpretation, we find that
LLM predictions are not robust under variation of method choice, both within a
single LLM and across different LLMs. As this variability entails pronounced
researcher degrees of freedom in reporting results, knowledge of the
variability is crucial to secure robustness of results and research integrity. |
---|---|
DOI: | 10.48550/arxiv.2403.00998 |