Sparse randomized policies for Markov decision processes based on Tsallis divergence regularization

This work investigates a somewhat different point of view on Markov decision processes by reinterpreting them as a randomized shortest paths problem on a bipartite graph, therefore establishing bridges with entropy-regularized reinforcement learning. The graph structure contains the set of states as...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Knowledge-based systems 2024-09, Vol.300, p.112105, Article 112105
Hauptverfasser: Leleux, Pierre, Lebichot, Bertrand, Guex, Guillaume, Saerens, Marco
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:This work investigates a somewhat different point of view on Markov decision processes by reinterpreting them as a randomized shortest paths problem on a bipartite graph, therefore establishing bridges with entropy-regularized reinforcement learning. The graph structure contains the set of states as “left” nodes and the set of actions as “right” nodes. In that context, the action-to-state transition probabilities are provided by the environment whereas the state-to-action probabilities correspond to the (stochastic) policy to be found. The randomized shortest paths formalism (minimizing expected cost to the goal state subject to (Shannon or Tsallis) relative entropy regularization) is then readily applied to this bipartite structure, providing a possibly sparse stochastic policy interpolating between a least-cost and a purely random policy. The algorithm computing the policy is closely related to the dual linear programming formulation of the Markov decision processes to which the relative entropy regularization term, multiplied by a scaling factor balancing exploitation and exploration (the temperature), is added. It is derived from well-known techniques of discrete optimal control, relying on costates (Lagrange parameters) backward computation. In summary, the proposed algorithm allows the design of optimal stochastic – but still sparse – policies, ranging from a purely rational to a random behavior, depending on the temperature parameter. •A costate-based algorithm for solving entropy-regularized MDPs is investigated.•The algorithm is based on a regularized dual linear programming formulation of MDPs.•Using the Tsallis r-divergence regularization provides sparse policies.•A temperature parameter controls the level of rationality of the agent.
ISSN:0950-7051
DOI:10.1016/j.knosys.2024.112105