Experiments with Detecting and Mitigating AI Deception
How to detect and mitigate deceptive AI systems is an open problem for the field of safe and trustworthy AI. We analyse two algorithms for mitigating deception: The first is based on the path-specific objectives framework where paths in the game that incentivise deception are removed. The second is...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | How to detect and mitigate deceptive AI systems is an open problem for the
field of safe and trustworthy AI. We analyse two algorithms for mitigating
deception: The first is based on the path-specific objectives framework where
paths in the game that incentivise deception are removed. The second is based
on shielding, i.e., monitoring for unsafe policies and replacing them with a
safe reference policy. We construct two simple games and evaluate our
algorithms empirically. We find that both methods ensure that our agent is not
deceptive, however, shielding tends to achieve higher reward. |
---|---|
DOI: | 10.48550/arxiv.2306.14816 |