Solving two-armed Bernoulli bandit problems using a Bayesian learning automaton

Purpose - The two-armed Bernoulli bandit (TABB) problem is a classical optimization problem where an agent sequentially pulls one of two arms attached to a gambling machine, with each pull resulting either in a reward or a penalty. The reward probabilities of each arm are unknown, and thus one must...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:International journal of intelligent computing and cybernetics 2010-01, Vol.3 (2), p.207-234
1. Verfasser: Granmo, Ole-Christoffer
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Purpose - The two-armed Bernoulli bandit (TABB) problem is a classical optimization problem where an agent sequentially pulls one of two arms attached to a gambling machine, with each pull resulting either in a reward or a penalty. The reward probabilities of each arm are unknown, and thus one must balance between exploiting existing knowledge about the arms, and obtaining new information. The purpose of this paper is to report research into a completely new family of solution schemes for the TABB problem: the Bayesian learning automaton (BLA) family.Design methodology approach - Although computationally intractable in many cases, Bayesian methods provide a standard for optimal decision making. BLA avoids the problem of computational intractability by not explicitly performing the Bayesian computations. Rather, it is based upon merely counting rewards penalties, combined with random sampling from a pair of twin Beta distributions. This is intuitively appealing since the Bayesian conjugate prior for a binomial parameter is the Beta distribution.Findings - BLA is to be proven instantaneously self-correcting, and it converges to only pulling the optimal arm with probability as close to unity as desired. Extensive experiments demonstrate that the BLA does not rely on external learning speed accuracy control. It also outperforms established non-Bayesian top performers for the TABB problem. Finally, the BLA provides superior performance in a distributed application, namely, the Goore game (GG).Originality value - The value of this paper is threefold. First of all, the reported BLA takes advantage of the Bayesian perspective for tackling TABBs, yet avoids the computational complexity inherent in Bayesian approaches. Second, the improved performance offered by the BLA opens up for increased accuracy in a number of TABB-related applications, such as the GG. Third, the reported results form the basis for a new avenue of research - even for cases when the reward penalty distribution is not Bernoulli distributed. Indeed, the paper advocates the use of a Bayesian methodology, used in conjunction with the corresponding appropriate conjugate prior.
ISSN:1756-378X
1756-3798
DOI:10.1108/17563781011049179