Status-quo policy gradient in Multi-Agent Reinforcement Learning
Individual rationality, which involves maximizing expected individual returns, does not always lead to high-utility individual or group outcomes in multi-agent problems. For instance, in multi-agent social dilemmas, Reinforcement Learning (RL) agents trained to maximize individual rewards converge t...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Individual rationality, which involves maximizing expected individual
returns, does not always lead to high-utility individual or group outcomes in
multi-agent problems. For instance, in multi-agent social dilemmas,
Reinforcement Learning (RL) agents trained to maximize individual rewards
converge to a low-utility mutually harmful equilibrium. In contrast, humans
evolve useful strategies in such social dilemmas. Inspired by ideas from human
psychology that attribute this behavior to the status-quo bias, we present a
status-quo loss (SQLoss) and the corresponding policy gradient algorithm that
incorporates this bias in an RL agent. We demonstrate that agents trained with
SQLoss learn high-utility policies in several social dilemma matrix games
(Prisoner's Dilemma, Stag Hunt matrix variant, Chicken Game). We show how
SQLoss outperforms existing state-of-the-art methods to obtain high-utility
policies in visual input non-matrix games (Coin Game and Stag Hunt visual input
variant) using pre-trained cooperation and defection oracles. Finally, we show
that SQLoss extends to a 4-agent setting by demonstrating the emergence of
cooperative behavior in the popular Braess' paradox. |
---|---|
DOI: | 10.48550/arxiv.2111.11692 |