TY - GEN
T1 - Pre-Warming is Not Enough
T2 - 15th Annual ACM Symposium on Cloud Computing, SoCC 2024
AU - Sui, Yifan
AU - Yu, Hanfei
AU - Hu, Yitao
AU - Li, Jianxun
AU - Wang, Hao
N1 - Publisher Copyright:
© 2024 ACM.
PY - 2024/11/20
Y1 - 2024/11/20
N2 - Serverless computing has rapidly prospered as a new cloud computing paradigm with agile scalability, pay-as-you-go pricing, and ease-to-use features for Machine Learning (ML) inference tasks. Users package their ML code into lightweight serverless functions and execute them using containers. Unfortunately, a notorious problem, called cold-starts, hinders serverless computing from providing low-latency function executions. To mitigate cold-starts, pre-warming, which keeps containers warm predictively, has been widely accepted by academia and industry. However, pre-warming fails to eliminate the unique latency incurred by loading ML artifacts. We observed that for ML inference functions, the loading of libraries and models takes significantly more time than container warming. Consequently, pre-warming alone is not enough to mitigate the ML inference function's cold-starts. This paper introduces InstaInfer, an opportunistic preloading technique to achieve instant inference by eliminating the latency associated with loading ML artifacts, thereby achieving minimal time cost in function execution. InstaInfer fully utilizes the memory of warmed containers to preload the function's libraries and model, striking a balance between maximum acceleration and resource wastage. We design InstaInfer to be transparent to providers and compatible with existing pre-warming solutions. Experiments on OpenWhisk with real-world workloads show that InstaInfer reduces up to 93% loading latency and achieves up to 8× speedup compared to state-of-the-art pre-warming solutions.
AB - Serverless computing has rapidly prospered as a new cloud computing paradigm with agile scalability, pay-as-you-go pricing, and ease-to-use features for Machine Learning (ML) inference tasks. Users package their ML code into lightweight serverless functions and execute them using containers. Unfortunately, a notorious problem, called cold-starts, hinders serverless computing from providing low-latency function executions. To mitigate cold-starts, pre-warming, which keeps containers warm predictively, has been widely accepted by academia and industry. However, pre-warming fails to eliminate the unique latency incurred by loading ML artifacts. We observed that for ML inference functions, the loading of libraries and models takes significantly more time than container warming. Consequently, pre-warming alone is not enough to mitigate the ML inference function's cold-starts. This paper introduces InstaInfer, an opportunistic preloading technique to achieve instant inference by eliminating the latency associated with loading ML artifacts, thereby achieving minimal time cost in function execution. InstaInfer fully utilizes the memory of warmed containers to preload the function's libraries and model, striking a balance between maximum acceleration and resource wastage. We design InstaInfer to be transparent to providers and compatible with existing pre-warming solutions. Experiments on OpenWhisk with real-world workloads show that InstaInfer reduces up to 93% loading latency and achieves up to 8× speedup compared to state-of-the-art pre-warming solutions.
KW - Cloud Computing
KW - Cold-Start
KW - Machine Learning
KW - Serverless Computing
UR - http://www.scopus.com/inward/record.url?scp=85215511458&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85215511458&partnerID=8YFLogxK
U2 - 10.1145/3698038.3698509
DO - 10.1145/3698038.3698509
M3 - Conference contribution
AN - SCOPUS:85215511458
T3 - SoCC 2024 - Proceedings of the 2024 ACM Symposium on Cloud Computing
SP - 178
EP - 195
BT - SoCC 2024 - Proceedings of the 2024 ACM Symposium on Cloud Computing
Y2 - 20 November 2024 through 22 November 2024
ER -