Multiagent reinforcement learning for strictly constrained tasks based on Reward Recorder

Multiagent reinforcement learning (MARL) has been widely applied in engineering problems. However, many strictly constrained problems such as distributed optimization in engineering applications are still a great challenge to MARL. Especially for strict global constraints of agents' actions, it...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:International journal of intelligent systems 2022-11, Vol.37 (11), p.8387-8411
Hauptverfasser: Ding, Lifu, Yan, Gangfeng, Liu, Jianing
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Multiagent reinforcement learning (MARL) has been widely applied in engineering problems. However, many strictly constrained problems such as distributed optimization in engineering applications are still a great challenge to MARL. Especially for strict global constraints of agents' actions, it is very easy to lead to sparse rewards. Besides, existing studies cannot solve the instability caused by partial observability while making the algorithm fully distributed. Algorithms with centralized training may encounter significant obstacles in real‐world deployment. For the first time, we provide theoretical analysis for MARL to determine the adverse effects of partial observability on convergence, and a fully distributed and convergent MARL algorithm based on Reward Recorder is proposed. Each agent runs an independent reinforcement learning algorithm and uses the average‐consensus protocol to estimate the global state‐action value locally to achieve global optimization. To verify the performance of the algorithm, we propose a novel generalized constrained optimization model, which includes local inequality constraints and strict global constraints. The proposed distributed reinforcement learning algorithm is supported by several simulation examples. The results reveal that the proposed algorithm has high stability and excellent decision‐making ability.
ISSN:0884-8173
1098-111X
DOI:10.1002/int.22945