TY - GEN
T1 - TensorOpera Router
T2 - 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024
AU - Stripelis, Dimitris
AU - Hu, Zijian
AU - Zhang, Jipeng
AU - Xu, Zhaozhuo
AU - Shah, Alay Dilipbhai
AU - Jin, Han
AU - Yao, Yuhang
AU - Zhang, Tong
AU - Avestimehr, Salman
AU - He, Chaoyang
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - With the rapid growth of Large Language Models (LLMs) across various domains, numerous new LLMs have emerged, each possessing domain-specific expertise. This proliferation has highlighted the need for quick, high-quality, and cost-effective LLM query response methods. Yet, no single LLM exists to efficiently balance this trilemma. Some models are powerful but extremely costly, while others are fast and inexpensive but qualitatively inferior. To address this challenge, we present TO-Router, a non-monolithic LLM querying system that seamlessly integrates various LLM experts into a single query interface and dynamically routes incoming queries to the most high-performant expert based on query’s requirements. Through extensive experiments, we demonstrate that when compared to standalone expert models, TO-Router improves query efficiency by up to 40%, and leads to significant cost reductions of up to 30%, while maintaining or enhancing model performance by up to 10%.
AB - With the rapid growth of Large Language Models (LLMs) across various domains, numerous new LLMs have emerged, each possessing domain-specific expertise. This proliferation has highlighted the need for quick, high-quality, and cost-effective LLM query response methods. Yet, no single LLM exists to efficiently balance this trilemma. Some models are powerful but extremely costly, while others are fast and inexpensive but qualitatively inferior. To address this challenge, we present TO-Router, a non-monolithic LLM querying system that seamlessly integrates various LLM experts into a single query interface and dynamically routes incoming queries to the most high-performant expert based on query’s requirements. Through extensive experiments, we demonstrate that when compared to standalone expert models, TO-Router improves query efficiency by up to 40%, and leads to significant cost reductions of up to 30%, while maintaining or enhancing model performance by up to 10%.
UR - http://www.scopus.com/inward/record.url?scp=85216763709&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85216763709&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85216763709
T3 - EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Industry Track
SP - 452
EP - 462
BT - EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Industry Track
A2 - Dernoncourt, Franck
A2 - Preotiuc-Pietro, Daniel
A2 - Shimorina, Anastasia
Y2 - 12 November 2024 through 16 November 2024
ER -