Residual Physics and Post-Posed Shielding for Safe Deep Reinforcement Learning Method

Deep reinforcement learning (DRL) has been researched for computer room air conditioning unit control problems in data centers (DCs). However, two main issues limit the deployment of DRL in actual systems. First, a large amount of data is needed. Next, as a mission-critical system, safe control need...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on cybernetics 2024-02, Vol.54 (2), p.865-876
Hauptverfasser: Zhang, Qingang, Mahbod, Muhammad Haiqal Bin, Chng, Chin-Boon, Lee, Poh-Seng, Chui, Chee-Kong
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Deep reinforcement learning (DRL) has been researched for computer room air conditioning unit control problems in data centers (DCs). However, two main issues limit the deployment of DRL in actual systems. First, a large amount of data is needed. Next, as a mission-critical system, safe control needs to be guaranteed, and temperatures in DCs should be kept within a certain operating range. To mitigate these issues, this article proposes a novel control method RP-SDRL. First, Residual Physics, built using the first law of thermodynamics, is integrated with the DRL algorithm and a Prediction Model. Subsequently, a Correction Model adapted from gradient descent is combined with the Prediction Model as Post-Posed Shielding to enforce safe actions. The RP-SDRL method was validated using simulation. Noise is added to the states of the model to further test its performance under state uncertainty. Experimental results show that the combination of Residual Physics and DRL can significantly improve the initial policy, sample efficiency, and robustness. Residual Physics can also improve the sample efficiency and the accuracy of the prediction model. While DRL alone cannot avoid constraint violations, RP-SDRL can detect unsafe actions and significantly reduce violations. Compared to the baseline controller, about 13% of electricity usage can be saved.
ISSN:2168-2267
2168-2275
DOI:10.1109/TCYB.2022.3178084