TY - GEN
T1 - Multi-policy posterior sampling for restless Markov bandits
AU - Alnatheer, Suleman
AU - Man, Hong
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2014/2/5
Y1 - 2014/2/5
N2 - This paper considers Multi-Arms Restless Bandits problem, where each arm have time varying rewards generated from unknown two-states discrete time Markov process. Each chain is assumed irreducible, aperiodic, and non-reactive to agent actions. Optimal solution or constant value approximation to all instances of Restless Bandits problem does not exist; in fact it has been proven to be intractable even if all parameters were deterministic. A polynomial time algorithm is proposed that learns transitional parameters for each arm and selects the perceived optimal policy from a set of predefined policies using a beliefs or probability distributions. More precisely, the proposed algorithm compares mean rewards of consistently staying with best perceived arm to means rewards of Myopically accessed combination of arms using randomized probability matching or better known as Thompson Sampling. Empirical evaluations are presented at the end of the paper that show an improve performance in all instances of the problem compared to other existing algorithms except a small set of instances where arms are similar and bursty.
AB - This paper considers Multi-Arms Restless Bandits problem, where each arm have time varying rewards generated from unknown two-states discrete time Markov process. Each chain is assumed irreducible, aperiodic, and non-reactive to agent actions. Optimal solution or constant value approximation to all instances of Restless Bandits problem does not exist; in fact it has been proven to be intractable even if all parameters were deterministic. A polynomial time algorithm is proposed that learns transitional parameters for each arm and selects the perceived optimal policy from a set of predefined policies using a beliefs or probability distributions. More precisely, the proposed algorithm compares mean rewards of consistently staying with best perceived arm to means rewards of Myopically accessed combination of arms using randomized probability matching or better known as Thompson Sampling. Empirical evaluations are presented at the end of the paper that show an improve performance in all instances of the problem compared to other existing algorithms except a small set of instances where arms are similar and bursty.
KW - Gilbert-Elliot model
KW - Opportunistic spectrum access
KW - Posterior sampling
KW - Restless Bandits
UR - http://www.scopus.com/inward/record.url?scp=84949928394&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84949928394&partnerID=8YFLogxK
U2 - 10.1109/GlobalSIP.2014.7032327
DO - 10.1109/GlobalSIP.2014.7032327
M3 - Conference contribution
AN - SCOPUS:84949928394
T3 - 2014 IEEE Global Conference on Signal and Information Processing, GlobalSIP 2014
SP - 1271
EP - 1275
BT - 2014 IEEE Global Conference on Signal and Information Processing, GlobalSIP 2014
T2 - 2014 IEEE Global Conference on Signal and Information Processing, GlobalSIP 2014
Y2 - 3 December 2014 through 5 December 2014
ER -