Optimality Conditions for Long-Run Average Rewards With Underselectivity and Nonsmooth Features

We study three existing issues associated with optimization of long-run average rewards of time-nonhomogeneous Markov processes in continuous time with continuous state spaces: 1) the underselectivity, i.e., the optimal policies do not depend on their actions in any finite time period; 2) its relate...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on automatic control 2017-09, Vol.62 (9), p.4318-4332
1. Verfasser:	Cao, Xi-Ren
Format:	Artikel
Sprache:	eng
Schlagworte:	Aerospace electronics Bias optimality direct-comparison-based optimization Dynamic programming Generators Ito–Tanaka formula local time Markov processes Optimization relative optimization semismooth functions state comparability theory Transient analysis Viscosity viscosity solution
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	We study three existing issues associated with optimization of long-run average rewards of time-nonhomogeneous Markov processes in continuous time with continuous state spaces: 1) the underselectivity, i.e., the optimal policies do not depend on their actions in any finite time period; 2) its related issue, the bias optimality, i.e., policies that optimize both long-run average and transient total rewards, and 3) the effects of a nonsmooth point of a value function on performance optimization. These issues require considerations of the performance in the entire period with an infinite horizon, and therefore are not easily solvable by dynamic programming, which works backwards in time and takes a local view at a particular time instant. In this paper, we take a different approach called the relative optimization theory, which is based on a direct comparison of the performance measures of any two policies. We derive tight necessary and sufficient optimality conditions that take the underselectivity into consideration; we derive bias optimality conditions for both long-run average and transient rewards; and we show that the effect of a wide class of nonsmooth points, called semismooth points, of a value function on the long-run average performance is zero and can be ignored.
ISSN:	0018-9286 1558-2523
DOI:	10.1109/TAC.2017.2655487