TY - GEN
T1 - Stellaris
T2 - 2024 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2024
AU - Yu, Hanfei
AU - Wang, Hao
AU - Tiwari, Devesh
AU - Li, Jian
AU - Park, Seung Jong
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Deep reinforcement learning (DRL) has achieved remarkable success in diverse areas, including gaming AI, scientific simulations, and large-scale (HPC) system scheduling. DRL training, which involves a trial-and-error process, demands considerable time and computational resources. To overcome this challenge, distributed DRL algorithms and frameworks have been developed to expedite training by leveraging large-scale resources. However, existing distributed DRL solutions rely on synchronous learning with serverful infrastructures, suffering from low training efficiency and overwhelming training costs. This paper proposes Stellaris, the first to introduce a generic asynchronous learning paradigm for distributed DRL training with serverless computing. We devise an importance sampling truncation technique to stabilize DRL training and develop a staleness-aware gradient aggregation method tailored to the dynamic staleness in asynchronous serverless DRL training. Experiments on AWS EC2 regular testbeds and HPC clusters show that Stellaris outperforms existing state-of-the-art DRL baselines by achieving 2.2 × higher rewards (i.e., training quality) and reducing 41% training costs.
AB - Deep reinforcement learning (DRL) has achieved remarkable success in diverse areas, including gaming AI, scientific simulations, and large-scale (HPC) system scheduling. DRL training, which involves a trial-and-error process, demands considerable time and computational resources. To overcome this challenge, distributed DRL algorithms and frameworks have been developed to expedite training by leveraging large-scale resources. However, existing distributed DRL solutions rely on synchronous learning with serverful infrastructures, suffering from low training efficiency and overwhelming training costs. This paper proposes Stellaris, the first to introduce a generic asynchronous learning paradigm for distributed DRL training with serverless computing. We devise an importance sampling truncation technique to stabilize DRL training and develop a staleness-aware gradient aggregation method tailored to the dynamic staleness in asynchronous serverless DRL training. Experiments on AWS EC2 regular testbeds and HPC clusters show that Stellaris outperforms existing state-of-the-art DRL baselines by achieving 2.2 × higher rewards (i.e., training quality) and reducing 41% training costs.
UR - http://www.scopus.com/inward/record.url?scp=85213889656&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85213889656&partnerID=8YFLogxK
U2 - 10.1109/SC41406.2024.00045
DO - 10.1109/SC41406.2024.00045
M3 - Conference contribution
AN - SCOPUS:85213889656
T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
BT - Proceedings of SC 2024
Y2 - 17 November 2024 through 22 November 2024
ER -