TY - GEN
T1 - ScaleLLM
T2 - 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024
AU - Yao, Yuhang
AU - Jin, Han
AU - Shah, Alay Dilipbhai
AU - Han, Shanshan
AU - Hu, Zijian
AU - Stripelis, Dimitris
AU - Ran, Yide
AU - Xu, Zhaozhuo
AU - Avestimehr, Salman
AU - He, Chaoyang
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - Large language models (LLMs) have surged in popularity and are extensively used in commercial applications, where the efficiency of model serving is crucial for the user experience. Most current research focuses on optimizing individual sub-procedures, e.g. local inference and communication, however, there is no comprehensive framework that provides a holistic system view for optimizing LLM serving in an end-to-end manner. In this work, we conduct a detailed analysis to identify major bottlenecks that impact end-to-end latency in LLM serving systems. Our analysis reveals that a comprehensive LLM serving endpoint must address a series of efficiency bottlenecks that extend beyond LLM inference. We then propose ScaleLLM, an optimized system for resource-efficient LLM serving. Our extensive experiments reveal that with 64 concurrent requests on Mixtral 8x7B, ScaleLLM achieves a 4.3× speed up over vLLM and outperforms state-of-the-arts with 1.5×
AB - Large language models (LLMs) have surged in popularity and are extensively used in commercial applications, where the efficiency of model serving is crucial for the user experience. Most current research focuses on optimizing individual sub-procedures, e.g. local inference and communication, however, there is no comprehensive framework that provides a holistic system view for optimizing LLM serving in an end-to-end manner. In this work, we conduct a detailed analysis to identify major bottlenecks that impact end-to-end latency in LLM serving systems. Our analysis reveals that a comprehensive LLM serving endpoint must address a series of efficiency bottlenecks that extend beyond LLM inference. We then propose ScaleLLM, an optimized system for resource-efficient LLM serving. Our extensive experiments reveal that with 64 concurrent requests on Mixtral 8x7B, ScaleLLM achieves a 4.3× speed up over vLLM and outperforms state-of-the-arts with 1.5×
UR - http://www.scopus.com/inward/record.url?scp=85216746698&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85216746698&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85216746698
T3 - EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Industry Track
SP - 279
EP - 289
BT - EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Industry Track
A2 - Dernoncourt, Franck
A2 - Preotiuc-Pietro, Daniel
A2 - Shimorina, Anastasia
Y2 - 12 November 2024 through 16 November 2024
ER -