GPT is Not an Annotator: The Necessity of Human Annotation in Fairness Benchmark Construction
Social biases in LLMs are usually measured via bias benchmark datasets. Current benchmarks have limitations in scope, grounding, quality, and human effort required. Previous work has shown success with a community-sourced, rather than crowd-sourced, approach to benchmark development. However, this w...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Social biases in LLMs are usually measured via bias benchmark datasets.
Current benchmarks have limitations in scope, grounding, quality, and human
effort required. Previous work has shown success with a community-sourced,
rather than crowd-sourced, approach to benchmark development. However, this
work still required considerable effort from annotators with relevant lived
experience. This paper explores whether an LLM (specifically, GPT-3.5-Turbo)
can assist with the task of developing a bias benchmark dataset from responses
to an open-ended community survey. We also extend the previous work to a new
community and set of biases: the Jewish community and antisemitism. Our
analysis shows that GPT-3.5-Turbo has poor performance on this annotation task
and produces unacceptable quality issues in its output. Thus, we conclude that
GPT-3.5-Turbo is not an appropriate substitute for human annotation in
sensitive tasks related to social biases, and that its use actually negates
many of the benefits of community-sourcing bias benchmarks. |
---|---|
DOI: | 10.48550/arxiv.2405.15760 |