TensorOpera Router: A Multi-Model Router for Efficient LLM Inference

Dimitris Stripelis, Zijian Hu, Jipeng Zhang, Zhaozhuo Xu, Alay Dilipbhai Shah, Han Jin, Yuhang Yao, Tong Zhang, Salman Avestimehr, Chaoyang He

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

With the rapid growth of Large Language Models (LLMs) across various domains, numerous new LLMs have emerged, each possessing domain-specific expertise. This proliferation has highlighted the need for quick, high-quality, and cost-effective LLM query response methods. Yet, no single LLM exists to efficiently balance this trilemma. Some models are powerful but extremely costly, while others are fast and inexpensive but qualitatively inferior. To address this challenge, we present TO-Router, a non-monolithic LLM querying system that seamlessly integrates various LLM experts into a single query interface and dynamically routes incoming queries to the most high-performant expert based on query’s requirements. Through extensive experiments, we demonstrate that when compared to standalone expert models, TO-Router improves query efficiency by up to 40%, and leads to significant cost reductions of up to 30%, while maintaining or enhancing model performance by up to 10%.

Original languageEnglish
Title of host publicationEMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Industry Track
EditorsFranck Dernoncourt, Daniel Preotiuc-Pietro, Anastasia Shimorina
Pages452-462
Number of pages11
ISBN (Electronic)9798891761667
StatePublished - 2024
Event2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024 - Hybrid, Miami, United States
Duration: 12 Nov 202416 Nov 2024

Publication series

NameEMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Industry Track

Conference

Conference2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024
Country/TerritoryUnited States
CityHybrid, Miami
Period12/11/2416/11/24

Fingerprint

Dive into the research topics of 'TensorOpera Router: A Multi-Model Router for Efficient LLM Inference'. Together they form a unique fingerprint.

Cite this