Class-RAG: Real-Time Content Moderation with Retrieval Augmented Generation
Robust content moderation classifiers are essential for the safety of Generative AI systems. In this task, differences between safe and unsafe inputs are often extremely subtle, making it difficult for classifiers (and indeed, even humans) to properly distinguish violating vs. benign samples without...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Robust content moderation classifiers are essential for the safety of
Generative AI systems. In this task, differences between safe and unsafe inputs
are often extremely subtle, making it difficult for classifiers (and indeed,
even humans) to properly distinguish violating vs. benign samples without
context or explanation. Scaling risk discovery and mitigation through
continuous model fine-tuning is also slow, challenging and costly, preventing
developers from being able to respond quickly and effectively to emergent
harms. We propose a Classification approach employing Retrieval-Augmented
Generation (Class-RAG). Class-RAG extends the capability of its base LLM
through access to a retrieval library which can be dynamically updated to
enable semantic hotfixing for immediate, flexible risk mitigation. Compared to
model fine-tuning, Class-RAG demonstrates flexibility and transparency in
decision-making, outperforms on classification and is more robust against
adversarial attack, as evidenced by empirical studies. Our findings also
suggest that Class-RAG performance scales with retrieval library size,
indicating that increasing the library size is a viable and low-cost approach
to improve content moderation. |
---|---|
DOI: | 10.48550/arxiv.2410.14881 |