Adaptive Value Decomposition with Greedy Marginal Contribution Computation for Cooperative Multi-Agent Reinforcement Learning
Real-world cooperation often requires intensive coordination among agents simultaneously. This task has been extensively studied within the framework of cooperative multi-agent reinforcement learning (MARL), and value decomposition methods are among those cutting-edge solutions. However, traditional...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Real-world cooperation often requires intensive coordination among agents
simultaneously. This task has been extensively studied within the framework of
cooperative multi-agent reinforcement learning (MARL), and value decomposition
methods are among those cutting-edge solutions. However, traditional methods
that learn the value function as a monotonic mixing of per-agent utilities
cannot solve the tasks with non-monotonic returns. This hinders their
application in generic scenarios. Recent methods tackle this problem from the
perspective of implicit credit assignment by learning value functions with
complete expressiveness or using additional structures to improve cooperation.
However, they are either difficult to learn due to large joint action spaces or
insufficient to capture the complicated interactions among agents which are
essential to solving tasks with non-monotonic returns. To address these
problems, we propose a novel explicit credit assignment method to address the
non-monotonic problem. Our method, Adaptive Value decomposition with Greedy
Marginal contribution (AVGM), is based on an adaptive value decomposition that
learns the cooperative value of a group of dynamically changing agents. We
first illustrate that the proposed value decomposition can consider the
complicated interactions among agents and is feasible to learn in large-scale
scenarios. Then, our method uses a greedy marginal contribution computed from
the value decomposition as an individual credit to incentivize agents to learn
the optimal cooperative policy. We further extend the module with an action
encoder to guarantee the linear time complexity for computing the greedy
marginal contribution. Experimental results demonstrate that our method
achieves significant performance improvements in several non-monotonic domains. |
---|---|
DOI: | 10.48550/arxiv.2302.06872 |