Detectors for Safe and Reliable LLMs: Implementations, Uses, and Limitations
Large language models (LLMs) are susceptible to a variety of risks, from non-faithful output to biased and toxic generations. Due to several limiting factors surrounding LLMs (training cost, API access, data availability, etc.), it may not always be feasible to impose direct safety constraints on a...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Large language models (LLMs) are susceptible to a variety of risks, from
non-faithful output to biased and toxic generations. Due to several limiting
factors surrounding LLMs (training cost, API access, data availability, etc.),
it may not always be feasible to impose direct safety constraints on a deployed
model. Therefore, an efficient and reliable alternative is required. To this
end, we present our ongoing efforts to create and deploy a library of
detectors: compact and easy-to-build classification models that provide labels
for various harms. In addition to the detectors themselves, we discuss a wide
range of uses for these detector models - from acting as guardrails to enabling
effective AI governance. We also deep dive into inherent challenges in their
development and discuss future work aimed at making the detectors more reliable
and broadening their scope. |
---|---|
DOI: | 10.48550/arxiv.2403.06009 |