Multi-policy posterior sampling for restless Markov bandits

Suleman Alnatheer, Hong Man

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

This paper considers Multi-Arms Restless Bandits problem, where each arm have time varying rewards generated from unknown two-states discrete time Markov process. Each chain is assumed irreducible, aperiodic, and non-reactive to agent actions. Optimal solution or constant value approximation to all instances of Restless Bandits problem does not exist; in fact it has been proven to be intractable even if all parameters were deterministic. A polynomial time algorithm is proposed that learns transitional parameters for each arm and selects the perceived optimal policy from a set of predefined policies using a beliefs or probability distributions. More precisely, the proposed algorithm compares mean rewards of consistently staying with best perceived arm to means rewards of Myopically accessed combination of arms using randomized probability matching or better known as Thompson Sampling. Empirical evaluations are presented at the end of the paper that show an improve performance in all instances of the problem compared to other existing algorithms except a small set of instances where arms are similar and bursty.

Original languageEnglish
Title of host publication2014 IEEE Global Conference on Signal and Information Processing, GlobalSIP 2014
Pages1271-1275
Number of pages5
ISBN (Electronic)9781479970889
DOIs
StatePublished - 5 Feb 2014
Event2014 IEEE Global Conference on Signal and Information Processing, GlobalSIP 2014 - Atlanta, United States
Duration: 3 Dec 20145 Dec 2014

Publication series

Name2014 IEEE Global Conference on Signal and Information Processing, GlobalSIP 2014

Conference

Conference2014 IEEE Global Conference on Signal and Information Processing, GlobalSIP 2014
Country/TerritoryUnited States
CityAtlanta
Period3/12/145/12/14

Keywords

  • Gilbert-Elliot model
  • Opportunistic spectrum access
  • Posterior sampling
  • Restless Bandits

Fingerprint

Dive into the research topics of 'Multi-policy posterior sampling for restless Markov bandits'. Together they form a unique fingerprint.

Cite this