The world of Large Language Models (LLMs) presents a classic trade-off: performance versus cost. On one hand, you have state-of-the-art models such as GPT-4 that excel at complex reasoning, sophisticated coding, and nuanced conversation. On the other, there are smaller, open-source models that are dramatically cheaper to run but often falter on demanding tasks.

For real-world applications—from customer service chatbots to data analysis tools—the challenge is the same:

How do you achieve the best possible performance without breaking the budget?


A Motivation: Smart Model Use in Practice

Imagine a customer service bot. For a simple query such as:

“What are your business hours?”

a small, fast, inexpensive model is perfect. But for a complex request like:

“Compare your two flagship smartphones based on battery life, low-light camera performance, and multitasking capabilities.”

you need the big guns—a large, high-end model with stronger reasoning capacity. Running a heavyweight model on every query is wasteful and expensive, while using a lightweight model for everything risks frustrating your users.

This is the essence of the LLM routing problem: dynamically selecting the most appropriate model for each incoming query.


Limitations of Traditional Solutions

Traditional approaches frame the problem as supervised learning:

  • Build a massive “oracle” dataset by running every query through every available LLM.
  • Label the dataset with the best-performing model for each query.
  • Train a routing model on this full-information dataset.

This approach works in a controlled setting but suffers from two major flaws:

  1. Astronomical Data Collection Costs
    Every query must be answered by every model to determine the best one. This is time-consuming and expensive in terms of both computation and LLM inference costs.

  2. Lack of Adaptability
    Real-world queries evolve over time. New topics emerge. Static routers trained on old data fail to adapt.


The New Paradigm: Routing as a Bandit Problem

A recent paper, Adaptive LLM Routing under Budget Constraints, proposes a paradigm shift.

Instead of requiring costly full supervision, the authors treat routing as a contextual bandit problem, drawing inspiration from recommendation systems:

A recommender doesn’t show you every possible item—it picks one, observes your reaction (e.g., a click or not), and learns from that limited feedback.

Similarly, in LLM routing:

  • You choose one model for a query.
  • You get one piece of evaluative feedback (thumbs up/down, rating score).
  • You learn to make better choices over time—without knowing how all other models would have performed.

Introducing PILOT

The authors propose PILOT (Preference-prior Informed LinUCB fOr Adaptive RouTing), a specialized contextual bandit algorithm to learn intelligent routing under budget constraints.

PILOT rests on three pillars:

  1. A Shared Embedding Space — Represent both queries and LLMs in the same vector space so their cosine similarity reflects how well they match.

  2. Pretraining with Human Preferences — Use public datasets where humans have compared model responses to give the router a warm-start with sensible priors.

  3. Continuous Online Learning with a Budget Policy — Refine the embeddings as live feedback streams in and integrate a budget-aware selection mechanism.


Step 1: Pretraining with Human Preferences

Starting from scratch is inefficient. PILOT uses datasets like ChatArena containing tuples of:

1
(query, model_A_response, model_B_response, preferred_model)

The pretraining has two stability-focused phases:

Phase 1 — Learn Query Projections
Take existing query embeddings (e.g., from text-embedding-3-small) and learn a linear transformation into the shared space.
Use contrastive triplet loss to push embeddings of queries that share a preferred model closer together and push others apart.

Phase 2 — Learn LLM Embeddings
Freeze the query projection. Then, place embeddings for each LLM in the shared space so that for a given query, its preferred LLM lies close to the query embedding.
This is framed as a binary classification problem.

The two-phase pretraining process. First, query embeddings are clustered based on which LLM is preferred. Second, LLM embeddings are learned and placed within this shared space to align with the queries that favor them.

Figure 1: The two-phase pretraining process. (1) Contrastive learning aligns query embeddings based on shared LLM preferences. (2) The LLM embeddings themselves are learned and positioned within this shared space.

The result: initial LLM embeddings \(\theta_i^{\text{pref}}\) that encode the general task strengths of each model.


Step 2: Online Learning with Contextual Bandits

Once pretrained, PILOT evolves with live user feedback.

A high-level overview of the PILOT framework. A user query and budget constraints are fed into the router, which selects an LLM. The system then learns from user feedback to improve future routing decisions.

Figure 2: The online bandit router adapts by combining query context, an LLM pool, and budget constraints, learning from feedback.

Setup:

  • Context: Projected embedding of current query \(\psi(q_t)\).
  • Arms: The available LLMs.
  • Reward: Quality score \(r_t \in [0, 1]\) of the chosen LLM’s response.

The expected reward is modeled as cosine similarity between normalized query and LLM embeddings:

\[ \mathbb{E}[r_t|a, q_t] = \cos(\hat{\psi}(q_t), \hat{\theta}_a) = \hat{\psi}(q_t) \cdot \hat{\theta}_a \]

This linear form fits perfectly with LinUCB, which balances:

  • Exploitation — pick the arm with the best current estimate.
  • Exploration — try arms with high uncertainty.

The selection rule is:

\[ \operatorname*{arg\,max}_{a} \big[ \cos(\hat{\psi}(q_t), \tilde{\theta}_a^t) + \alpha \sqrt{ \hat{\psi}(q_t)^\top (A_a^t)^{-1} \hat{\psi}(q_t) } \big] \]

PILOT’s edge: Its initial parameters \(\theta_a^0\) are set to \(\theta_a^{\text{pref}}\) from pretraining—not zeros—resulting in faster convergence and lower regret.


Step 3: Budget-Aware Routing

Real-world systems must respect budget constraints.

The authors propose an Online Multi-Choice Knapsack Policy (ON-MCKP):

  • Budget \(B\) over \(Q\) queries.
  • Each LLM choice consumes cost (token usage) and yields an estimated reward.
  • At each step, only LLMs under a dynamic cost threshold are eligible.
  • Pick the eligible LLM with highest expected reward.

A binning strategy divides \(Q\) queries into bins with per-bin budgets. Unused budget spills over, enabling flexible allocation without overspending.


Experimental Setup and Results

Using Routerbench (64 tasks across reasoning, math, coding, conversation), they simulated online learning with:

  • Learning Bucket — bandit routing updates occur here.
  • Deployment Bucket — performance is evaluated here.

Performance and regret curves for PILOT and baseline methods on single-task (MMLU) and multi-task (Routerbench) datasets. PILOT (orange) consistently outperforms other methods.

Figure 3: Across both single-task (MMLU) and multi-task (Routerbench) settings, PILOT (orange) outperforms baselines in deployment performance (i), learning efficiency (ii), and regret (iii).

Key highlight:
In multi-task routing, PILOT achieves 93% of GPT-4’s performance at only 25% of its cost.

Routing behavior:

  • MMLU (complex reasoning): ~90% of queries to GPT-4.
  • GSM8K (math): ~94% to Claude-v1—a cheaper, math-strong model.
  • MBPP (coding): Balanced between GPT-4 and Claude.

Analysis and Ablations

Cost Policy Effectiveness
Compared to fixed-budget policies and even hindsight-optimized offline policies, PILOT’s online policy is competitive or superior.

Comparison of the proposed online cost policy (green) against simpler baselines.

Figure 4: PILOT’s policy (green) matches or exceeds simpler baselines in performance and arm ranking quality.

Comparison table for PILOT’s online cost policy versus an offline policy.

Table 2: Positive differences show PILOT outperforming an offline policy with perfect hindsight.

Routing Speed
PILOT adds negligible latency—10–38× faster selection than GPT-4’s inference time.

Routing time analysis versus GPT-4 inference time.

Table 3: Routing overhead is minimal compared to inference.

Embedding Sensitivity
Swapping OpenAI embeddings for Instructor-XL keeps performance strong.

Performance using a different embedding model (Instructor-XL).

Figure 5: PILOT remains superior and robust to embedding model changes.


Conclusion

Adaptive LLM Routing under Budget Constraints offers a practical, adaptive, cost-aware routing framework for LLM deployment:

  • Data-efficient — learns from single-response feedback.
  • Adaptive — continually refines choices for evolving query distributions.
  • Cost-effective — near-SOTA performance at a fraction of the cost.
  • Fast, robust — minimal latency, embedder-agnostic.

PILOT sets a strong blueprint for intelligent, budget-conscious LLM applications.

Future Directions

  • Integrating budget constraints directly into online learning (rather than separating cost policy).
  • Extending routing to multi-turn conversational contexts.

For practitioners deploying LLMs in budget-sensitive environments, PILOT is a step toward systems that are smart enough to spend wisely and adapt quickly—without sacrificing quality.