Project Details
Description
Reinforcement learning (RL) has enjoyed many successes in its application thus far to robots and autonomous systems. However, as a tool to support reliable, long-duration autonomy in the real world, the standard process of estimating solely the expected return of candidate actions offers a limited perspective on the likely outcomes of heavy-tailed and multi-modal value distributions. Recent breakthroughs in distributional reinforcement learning (DRL) have offered a potential way forward, but there are many unanswered questions regarding how value distributions can be leveraged to improve the performance, efficiency, and safety of reinforcement learning. Recent algorithms have out-performed the state of the art in model-free deep reinforcement learning, exploiting the value distribution as a descriptive approximation tool for maximizing an agents expected return. However, there is great potential for value distributions to be leveraged in other ways, such as in the exploration policies used by RL agents to select actions while learning.To this end, this research project will investigate how distributional reinforcement learning can support safe learning with robots and autonomous systems. Firstly, we will investigate the suitability of second-order stochastic dominance (SSD) as a mechanism for safe robot exploration in the context of distributional reinforcement learning. Secondly, we seek to develop optimization methods that preserve the integrity of the value distributions being approximated and examined, with the aim of guaranteeing that an SSD exploration criterion can be imposed with validity. Thirdly, we will rigorously examine the potential of the above for informing exploration policies that facilitate safe, high-performance robot learning and autonomous navigation. These three goals will be achieved through the execution of the following tasks: (1) Using the dispersion space as a tool for comparing and analyzing the value distributions associated with robot actions with respect to second-order stochastic dominance; (2) Framing distributional reinforcement learning as a free-energy minimization, achieving the first quantile DRL method to provably converge in the second moment of the value distribution; (3) Evaluating the performance of our resulting Dominant Particle Agent (DPA) algorithm for tabular distributional reinforcement learning against suitable benchmark problems and algorithms; (4) Implementing and characterizing a parametric variant of the DPA algorithm for learning over large state-action spaces, including images and other robot perceptual data; (5) Investigating the potential for safe exploration policies to detect and avoid failure modes when they are evident in estimated value distributions with multi-modal characteristics; (6) Using such exploration policies to design high-performance architectures for safe sim-to-real transfer of decision-making and control policies that support autonomous navigation.
Status | Active |
---|---|
Effective start/end date | 1/05/20 → … |