Challenges in Detoxifying Language Models
Large language models (LM) generate remarkably fluent text and can be efficiently adapted across NLP tasks. Measuring and guaranteeing the quality of generated text in terms of safety is imperative for deploying LMs in the real world; to this end, prior work often relies on automatic evaluation of L...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Large language models (LM) generate remarkably fluent text and can be
efficiently adapted across NLP tasks. Measuring and guaranteeing the quality of
generated text in terms of safety is imperative for deploying LMs in the real
world; to this end, prior work often relies on automatic evaluation of LM
toxicity. We critically discuss this approach, evaluate several toxicity
mitigation strategies with respect to both automatic and human evaluation, and
analyze consequences of toxicity mitigation in terms of model bias and LM
quality. We demonstrate that while basic intervention strategies can
effectively optimize previously established automatic metrics on the
RealToxicityPrompts dataset, this comes at the cost of reduced LM coverage for
both texts about, and dialects of, marginalized groups. Additionally, we find
that human raters often disagree with high automatic toxicity scores after
strong toxicity reduction interventions -- highlighting further the nuances
involved in careful evaluation of LM toxicity. |
---|---|
DOI: | 10.48550/arxiv.2109.07445 |