TY - JOUR
T1 - A Tale of Two Efficient Value Iteration Algorithms for Solving Linear MDPs with Large Action Space
AU - Xu, Zhaozhuo
AU - Song, Zhao
AU - Shrivastava, Anshumali
N1 - Publisher Copyright:
Copyright © 2023 by the author(s)
PY - 2023
Y1 - 2023
N2 - Markov Decision Process (MDP) with large action space naturally occurs in many applications such as language processing, information retrieval, and recommendation system. There have been various approaches to solve these MDPs through value iteration (VI). Unfortunately, all VI algorithms require expensive linear scans over the entire action space for value function estimation during each iteration. To this end, we present two provable Least-Squares Value Iteration (LSVI) algorithms with runtime complexity sublinear in the number of actions for linear MDPs. We formulate the value function estimation procedure in VI as an approximate maximum inner product search problem and propose a Locality Sensitive Hashing (LSH) type data structure to solve this problem with sublinear time complexity. Our major contribution is combining the guarantees of approximate maximum inner product search with the regret analysis of reinforcement learning. We prove that, with the appropriate choice of approximation factor, there exists a sweet spot. Our proposed Sublinear LSVI algorithms maintain the same regret as the original LSVI algorithms while reducing the runtime complexity to sublinear in the number of actions. To the best of our knowledge, this is the first work that combines LSH with reinforcement learning that results in provable improvements. We hope that our novel way of combining data structures and the iterative algorithm will open the door for further study into the cost reduction in reinforcement learning.
AB - Markov Decision Process (MDP) with large action space naturally occurs in many applications such as language processing, information retrieval, and recommendation system. There have been various approaches to solve these MDPs through value iteration (VI). Unfortunately, all VI algorithms require expensive linear scans over the entire action space for value function estimation during each iteration. To this end, we present two provable Least-Squares Value Iteration (LSVI) algorithms with runtime complexity sublinear in the number of actions for linear MDPs. We formulate the value function estimation procedure in VI as an approximate maximum inner product search problem and propose a Locality Sensitive Hashing (LSH) type data structure to solve this problem with sublinear time complexity. Our major contribution is combining the guarantees of approximate maximum inner product search with the regret analysis of reinforcement learning. We prove that, with the appropriate choice of approximation factor, there exists a sweet spot. Our proposed Sublinear LSVI algorithms maintain the same regret as the original LSVI algorithms while reducing the runtime complexity to sublinear in the number of actions. To the best of our knowledge, this is the first work that combines LSH with reinforcement learning that results in provable improvements. We hope that our novel way of combining data structures and the iterative algorithm will open the door for further study into the cost reduction in reinforcement learning.
UR - http://www.scopus.com/inward/record.url?scp=85162826963&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85162826963&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85162826963
VL - 206
SP - 788
EP - 836
JO - Proceedings of Machine Learning Research
JF - Proceedings of Machine Learning Research
T2 - 26th International Conference on Artificial Intelligence and Statistics, AISTATS 2023
Y2 - 25 April 2023 through 27 April 2023
ER -