Predictions from language models for multiple-choice tasks are not robust under variation of scoring methods

This paper systematically compares different methods of deriving item-level predictions of language models for multiple-choice tasks. It compares scoring methods for answer options based on free generation of responses, various probability-based scores, a Likert-scale style rating method, and embedd...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Tsvilodub, Polina, Wang, Hening, Grosch, Sharon, Franke, Michael
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!