The world of Large Language Models (LLMs) presents a classic trade-off: performance versus cost. On one hand, you have state-of-the-art models such as GPT-4 that excel at complex reasoning, sophisticated coding, and nuanced conversation. On the other, there are smaller, open-source models that are dramatically cheaper to run but often falter on demanding tasks.
For real-world applications—from customer service chatbots to data analysis tools—the challenge is the same:
How do you achieve the best possible performance without breaking the budget?
A Motivation: Smart Model Use in Practice
Imagine a customer service bot. For a simple query such as:
“What are your business hours?”
a small, fast, inexpensive model is perfect. But for a complex request like:
“Compare your two flagship smartphones based on battery life, low-light camera performance, and multitasking capabilities.”
you need the big guns—a large, high-end model with stronger reasoning capacity. Running a heavyweight model on every query is wasteful and expensive, while using a lightweight model for everything risks frustrating your users.
This is the essence of the LLM routing problem: dynamically selecting the most appropriate model for each incoming query.
Limitations of Traditional Solutions
Traditional approaches frame the problem as supervised learning:
- Build a massive “oracle” dataset by running every query through every available LLM.
- Label the dataset with the best-performing model for each query.
- Train a routing model on this full-information dataset.
This approach works in a controlled setting but suffers from two major flaws:
Astronomical Data Collection Costs
Every query must be answered by every model to determine the best one. This is time-consuming and expensive in terms of both computation and LLM inference costs.Lack of Adaptability
Real-world queries evolve over time. New topics emerge. Static routers trained on old data fail to adapt.
The New Paradigm: Routing as a Bandit Problem
A recent paper, Adaptive LLM Routing under Budget Constraints, proposes a paradigm shift.
Instead of requiring costly full supervision, the authors treat routing as a contextual bandit problem, drawing inspiration from recommendation systems:
A recommender doesn’t show you every possible item—it picks one, observes your reaction (e.g., a click or not), and learns from that limited feedback.
Similarly, in LLM routing:
- You choose one model for a query.
- You get one piece of evaluative feedback (thumbs up/down, rating score).
- You learn to make better choices over time—without knowing how all other models would have performed.
Introducing PILOT
The authors propose PILOT (Preference-prior Informed LinUCB fOr Adaptive RouTing), a specialized contextual bandit algorithm to learn intelligent routing under budget constraints.
PILOT rests on three pillars:
A Shared Embedding Space — Represent both queries and LLMs in the same vector space so their cosine similarity reflects how well they match.
Pretraining with Human Preferences — Use public datasets where humans have compared model responses to give the router a warm-start with sensible priors.
Continuous Online Learning with a Budget Policy — Refine the embeddings as live feedback streams in and integrate a budget-aware selection mechanism.
Step 1: Pretraining with Human Preferences
Starting from scratch is inefficient. PILOT uses datasets like ChatArena containing tuples of:
|
|
The pretraining has two stability-focused phases:
Phase 1 — Learn Query Projections
Take existing query embeddings (e.g., from text-embedding-3-small
) and learn a linear transformation into the shared space.
Use contrastive triplet loss to push embeddings of queries that share a preferred model closer together and push others apart.
Phase 2 — Learn LLM Embeddings
Freeze the query projection. Then, place embeddings for each LLM in the shared space so that for a given query, its preferred LLM lies close to the query embedding.
This is framed as a binary classification problem.
Figure 1: The two-phase pretraining process. (1) Contrastive learning aligns query embeddings based on shared LLM preferences. (2) The LLM embeddings themselves are learned and positioned within this shared space.
The result: initial LLM embeddings \(\theta_i^{\text{pref}}\)
that encode the general task strengths of each model.
Step 2: Online Learning with Contextual Bandits
Once pretrained, PILOT evolves with live user feedback.
Figure 2: The online bandit router adapts by combining query context, an LLM pool, and budget constraints, learning from feedback.
Setup:
- Context: Projected embedding of current query \(\psi(q_t)\).
- Arms: The available LLMs.
- Reward: Quality score \(r_t \in [0, 1]\) of the chosen LLM’s response.
The expected reward is modeled as cosine similarity between normalized query and LLM embeddings:
\[ \mathbb{E}[r_t|a, q_t] = \cos(\hat{\psi}(q_t), \hat{\theta}_a) = \hat{\psi}(q_t) \cdot \hat{\theta}_a \]This linear form fits perfectly with LinUCB, which balances:
- Exploitation — pick the arm with the best current estimate.
- Exploration — try arms with high uncertainty.
The selection rule is:
\[ \operatorname*{arg\,max}_{a} \big[ \cos(\hat{\psi}(q_t), \tilde{\theta}_a^t) + \alpha \sqrt{ \hat{\psi}(q_t)^\top (A_a^t)^{-1} \hat{\psi}(q_t) } \big] \]PILOT’s edge: Its initial parameters \(\theta_a^0\) are set to \(\theta_a^{\text{pref}}\) from pretraining—not zeros—resulting in faster convergence and lower regret.
Step 3: Budget-Aware Routing
Real-world systems must respect budget constraints.
The authors propose an Online Multi-Choice Knapsack Policy (ON-MCKP):
- Budget \(B\) over \(Q\) queries.
- Each LLM choice consumes cost (token usage) and yields an estimated reward.
- At each step, only LLMs under a dynamic cost threshold are eligible.
- Pick the eligible LLM with highest expected reward.
A binning strategy divides \(Q\) queries into bins with per-bin budgets. Unused budget spills over, enabling flexible allocation without overspending.
Experimental Setup and Results
Using Routerbench (64 tasks across reasoning, math, coding, conversation), they simulated online learning with:
- Learning Bucket — bandit routing updates occur here.
- Deployment Bucket — performance is evaluated here.
Figure 3: Across both single-task (MMLU) and multi-task (Routerbench) settings, PILOT (orange) outperforms baselines in deployment performance (i), learning efficiency (ii), and regret (iii).
Key highlight:
In multi-task routing, PILOT achieves 93% of GPT-4’s performance at only 25% of its cost.
Routing behavior:
- MMLU (complex reasoning): ~90% of queries to GPT-4.
- GSM8K (math): ~94% to Claude-v1—a cheaper, math-strong model.
- MBPP (coding): Balanced between GPT-4 and Claude.
Analysis and Ablations
Cost Policy Effectiveness
Compared to fixed-budget policies and even hindsight-optimized offline policies, PILOT’s online policy is competitive or superior.
Figure 4: PILOT’s policy (green) matches or exceeds simpler baselines in performance and arm ranking quality.
Table 2: Positive differences show PILOT outperforming an offline policy with perfect hindsight.
Routing Speed
PILOT adds negligible latency—10–38× faster selection than GPT-4’s inference time.
Table 3: Routing overhead is minimal compared to inference.
Embedding Sensitivity
Swapping OpenAI embeddings for Instructor-XL keeps performance strong.
Figure 5: PILOT remains superior and robust to embedding model changes.
Conclusion
Adaptive LLM Routing under Budget Constraints offers a practical, adaptive, cost-aware routing framework for LLM deployment:
- Data-efficient — learns from single-response feedback.
- Adaptive — continually refines choices for evolving query distributions.
- Cost-effective — near-SOTA performance at a fraction of the cost.
- Fast, robust — minimal latency, embedder-agnostic.
PILOT sets a strong blueprint for intelligent, budget-conscious LLM applications.
Future Directions
- Integrating budget constraints directly into online learning (rather than separating cost policy).
- Extending routing to multi-turn conversational contexts.
For practitioners deploying LLMs in budget-sensitive environments, PILOT is a step toward systems that are smart enough to spend wisely and adapt quickly—without sacrificing quality.