ReMIX: Regret Minimization for Monotonic Value Function Factorization in Multiagent Reinforcement Learning
Value function factorization methods have become a dominant approach for cooperative multiagent reinforcement learning under a centralized training and decentralized execution paradigm. By factorizing the optimal joint action-value function using a monotonic mixing function of agents' utilities...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Value function factorization methods have become a dominant approach for
cooperative multiagent reinforcement learning under a centralized training and
decentralized execution paradigm. By factorizing the optimal joint action-value
function using a monotonic mixing function of agents' utilities, these
algorithms ensure the consistency between joint and local action selections for
decentralized decision-making. Nevertheless, the use of monotonic mixing
functions also induces representational limitations. Finding the optimal
projection of an unrestricted mixing function onto monotonic function classes
is still an open problem. To this end, we propose ReMIX, formulating this
optimal projection problem for value function factorization as a regret
minimization over the projection weights of different state-action values. Such
an optimization problem can be relaxed and solved using the Lagrangian
multiplier method to obtain the close-form optimal projection weights. By
minimizing the resulting policy regret, we can narrow the gap between the
optimal and the restricted monotonic mixing functions, thus obtaining an
improved monotonic value function factorization. Our experimental results on
Predator-Prey and StarCraft Multiagent Challenge environments demonstrate the
effectiveness of our method, indicating the better capabilities of handling
environments with non-monotonic value functions. |
---|---|
DOI: | 10.48550/arxiv.2302.05593 |