Boosting Weak-to-Strong Agents in Multiagent Reinforcement Learning via Balanced PPO

Multiagent policy gradients (MAPGs), an essential branch of reinforcement learning (RL), have made great progress in both industry and academia. However, existing models do not pay attention to the inadequate training of individual policies, thus limiting the overall performance. We verify the exist...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transaction on neural networks and learning systems 2024-08, Vol.PP, p.1-14
Hauptverfasser: Huang, Sili, Chen, Hechang, Piao, Haiyin, Sun, Zhixiao, Chang, Yi, Sun, Lichao, Yang, Bo
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Multiagent policy gradients (MAPGs), an essential branch of reinforcement learning (RL), have made great progress in both industry and academia. However, existing models do not pay attention to the inadequate training of individual policies, thus limiting the overall performance. We verify the existence of imbalanced training in multiagent tasks and formally define it as an imbalance between policies (IBPs). To address the IBP issue, we propose a dynamic policy balance (DPB) model to balance the learning of each policy by dynamically reweighting the training samples. In addition, current methods for better performance strengthen the exploration of all policies, which leads to disregarding the training differences in the team and reducing learning efficiency. To overcome this drawback, we derive a technique named weighted entropy regularization (WER), a team-level exploration with additional incentives for individuals who exceed the team. DPB and WER are evaluated in homogeneous and heterogeneous tasks, effectively alleviating the imbalanced training problem and improving exploration efficiency. Furthermore, the experimental results show that our models can outperform the state-of-the-art MAPG methods and boast over 12.1 \% performance gain on average.
ISSN:2162-237X
2162-2388
2162-2388
DOI:10.1109/TNNLS.2024.3437366