If your AI application routes every request through a frontier model, your inference bill is higher than it needs to be — probably by a lot. Building an LLM model router in Python is one of the highest-leverage optimizations you can make as an ML engineer: it matches query complexity to model capability automatically, without sacrificing quality on the requests that actually need it.
The scale of the problem is real. LLM API spending hit $8.4 billion by mid-2025 and doubled year-over-year (Binadox, 2025). Yet 60–80% of production requests are routine tasks — summarization, classification, extraction — that a much cheaper model handles with near-identical accuracy.
This tutorial gives you working Python code for a three-tier cascade router using only httpx and provider REST APIs. No LangChain, no AWS Bedrock, no framework lock-in. Clean patterns you can drop into any stack.
Why You’re Overpaying for LLM Calls (And How Bad It Gets at Scale)
Here’s a concrete scenario: your application processes 100,000 requests per day, all routed through GPT-4o. Even at modest response lengths, you’re spending thousands per month — and that number compounds fast as usage grows.
What makes this painful is that most of those requests don’t need GPT-4o. A model like Claude Haiku or GPT-4o-mini handles date extraction, ticket classification, and paragraph summarization with near-identical accuracy at a fraction of the cost.
“Since 60–80% of production requests are routine tasks, intelligent routing can reduce API spend by 40–85%. One real-world team processing 100K daily requests cut their bill from $4,500/month to $1,500/month.” — Morph LLM / FutureAGI, 2025
The math scales brutally. A typical enterprise routing 70% of queries to a budget model, 20% to mid-tier, and 10% to a premium model can reduce per-query cost by 60–80% with minimal quality impact (Requesty Blog, 2025). And AI cloud inference spending is projected to hit $20.6 billion in 2026 — surpassing training costs for the first time (AnalyticsWeek / Market Clarity, 2026).
If you’re building at any real scale, routing isn’t optional. It’s table stakes.
Routing vs. Cascading vs. Fallback — Picking the Right Pattern
These three terms get conflated constantly. Using the wrong pattern costs you either money or quality, so it’s worth being precise.
Routing means classifying the query upfront and sending it to exactly one model. It’s fast and deterministic. Best when you have a reliable classifier and your query categories are well-defined.
Cascading means starting with the cheapest model and escalating only when the response doesn’t meet your quality threshold. It’s self-correcting — you don’t need a perfect upfront classifier. Stanford’s FrugalGPT demonstrated this approach can match GPT-4 accuracy with up to 98% cost reduction using cascade strategies.
Fallback is the simplest form: try model A, and if it errors or times out, use model B. This is error handling, not cost optimization. Most teams already do this.
The best production implementations combine all three: classify upfront when confidence is high, cascade when it isn’t, and fall back on infrastructure failures. Research from ETH Zürich found that a unified routing + cascading approach outperforms either strategy in isolation by up to 14% on cost-quality tradeoffs (ICLR 2025).
For most teams starting out: cascade first. It’s forgiving, requires no training data, and the overhead of one extra cheap-model call is negligible compared to the savings.
Classifying Query Complexity Without a Separate ML Pipeline
The biggest objection to routing is that you need a classifier, and classifiers need training data. That’s true for production-grade routing at scale. But you can capture 80% of the benefit with zero ML infrastructure using heuristics.
Here’s a lightweight classifier that uses pattern matching and query length as complexity signals:
import re
from enum import Enum
class Complexity(str, Enum):
SIMPLE = "simple"
MEDIUM = "medium"
COMPLEX = "complex"
SIMPLE_PATTERNS = [
r'\b(summarize|classify|translate|extract|list|format|convert)\b',
r'\b(what is|define|who is|when did|spell)\b',
]
COMPLEX_PATTERNS = [
r'\b(analyze|compare|evaluate|critique|synthesize|debate)\b',
r'\b(implement|write a function|refactor|design|architect)\b',
r'\b(why does|how would|what are the tradeoffs)\b',
]
def classify_complexity(query: str) -> Complexity:
"""Classify a query as simple, medium, or complex using heuristics."""
q = query.lower().strip()
word_count = len(q.split())
complex_hits = sum(1 for p in COMPLEX_PATTERNS if re.search(p, q))
simple_hits = sum(1 for p in SIMPLE_PATTERNS if re.search(p, q))
if complex_hits >= 1 or word_count > 120:
return Complexity.COMPLEX
elif simple_hits >= 1 and word_count < 50:
return Complexity.SIMPLE
else:
return Complexity.MEDIUM
This runs in microseconds and adds zero external dependencies. For production, you’ll eventually augment this with a fine-tuned binary classifier or embedding-based similarity search — but this is plenty to validate the routing approach before investing in ML infrastructure.
Building an LLM Model Router in Python: Full Code
The router below uses three tiers:
- Tier 0 (cheap): Claude Haiku, GPT-4o-mini, Groq Llama — fast, cheap, excellent for simple tasks
- Tier 1 (mid): Claude Sonnet, GPT-4o — balanced cost and capability
- Tier 2 (premium): Claude Opus — maximum capability for hard problems
The implementation speaks directly to OpenAI-compatible REST endpoints using httpx. Swap in any provider — Anthropic, Groq, local Ollama, or hosted open-source models — by changing the api_url and api_key.
# cascade_router.py
import httpx
import time
import logging
from dataclasses import dataclass
from typing import Optional
logger = logging.getLogger(__name__)
@dataclass
class ModelConfig:
name: str
api_url: str
api_key: str
cost_per_1k_input_tokens: float
cost_per_1k_output_tokens: float
timeout_seconds: float = 15.0
@dataclass
class RouteResult:
response: str
model_used: str
input_tokens: int
output_tokens: int
cost_usd: float
escalations: int
latency_ms: float
class CascadeRouter:
def __init__(self, tiers: list[ModelConfig]):
"""tiers: ordered list of ModelConfig, cheapest first."""
self.tiers = tiers
def _call_model(self, model: ModelConfig, messages: list[dict]) -> dict:
"""Make a synchronous REST call to an OpenAI-compatible endpoint."""
headers = {
"Authorization": f"Bearer {model.api_key}",
"Content-Type": "application/json",
}
payload = {
"model": model.name,
"messages": messages,
"max_tokens": 1024,
}
response = httpx.post(
model.api_url,
json=payload,
headers=headers,
timeout=model.timeout_seconds,
)
response.raise_for_status()
return response.json()
def _compute_cost(self, model: ModelConfig, usage: dict) -> float:
input_cost = (usage["prompt_tokens"] / 1000) * model.cost_per_1k_input_tokens
output_cost = (usage["completion_tokens"] / 1000) * model.cost_per_1k_output_tokens
return round(input_cost + output_cost, 6)
def route(
self,
query: str,
start_tier: int = 0,
max_cost_usd: float = 0.10,
) -> RouteResult:
"""
Attempt each tier in order, escalating on failure.
start_tier allows upfront routing to skip directly to tier 1 or 2.
"""
messages = [{"role": "user", "content": query}]
escalations = 0
t_start = time.monotonic()
for i, model in enumerate(self.tiers[start_tier:], start=start_tier):
try:
data = self._call_model(model, messages)
usage = data["usage"]
cost = self._compute_cost(model, usage)
latency = (time.monotonic() - t_start) * 1000
return RouteResult(
response=data["choices"][0]["message"]["content"],
model_used=model.name,
input_tokens=usage["prompt_tokens"],
output_tokens=usage["completion_tokens"],
cost_usd=cost,
escalations=escalations,
latency_ms=round(latency, 1),
)
except (httpx.TimeoutException, httpx.HTTPStatusError) as e:
logger.warning(f"Model {model.name} failed: {e}. Escalating.")
escalations += 1
if i == len(self.tiers) - 1:
raise RuntimeError("All models in the cascade failed.") from e
raise RuntimeError("Cascade exhausted without a result.")
Wire it up with your complexity classifier:
from cascade_router import CascadeRouter, ModelConfig
from complexity_classifier import classify_complexity, Complexity
TIERS = [
ModelConfig(
name="gpt-4o-mini",
api_url="https://api.openai.com/v1/chat/completions",
api_key="YOUR_OPENAI_KEY",
cost_per_1k_input_tokens=0.00015,
cost_per_1k_output_tokens=0.0006,
),
ModelConfig(
name="gpt-4o",
api_url="https://api.openai.com/v1/chat/completions",
api_key="YOUR_OPENAI_KEY",
cost_per_1k_input_tokens=0.0025,
cost_per_1k_output_tokens=0.01,
),
ModelConfig(
name="claude-opus-4",
api_url="https://api.anthropic.com/v1/messages",
api_key="YOUR_ANTHROPIC_KEY",
cost_per_1k_input_tokens=0.015,
cost_per_1k_output_tokens=0.075,
),
]
router = CascadeRouter(tiers=TIERS)
def handle_query(query: str) -> RouteResult:
complexity = classify_complexity(query)
start_tier = {Complexity.SIMPLE: 0, Complexity.MEDIUM: 1, Complexity.COMPLEX: 2}[complexity]
return router.route(query, start_tier=start_tier)
Swapping providers is as simple as changing api_url and api_key. The rest of the logic is provider-agnostic.
Implementing Confidence-Based Escalation With Self-Consistency Checks
Heuristic classification handles clear cases well. But what about ambiguous queries where you’re not sure if the cheap model will do a good enough job?
Self-consistency is a practical confidence proxy: run the same query through the cheap model twice (with temperature > 0) and compare the responses. High semantic similarity means the model is giving a stable, confident answer. Low similarity means it’s uncertain — escalate.
from difflib import SequenceMatcher
def semantic_similarity(a: str, b: str) -> float:
"""Character-level similarity ratio as a quick confidence proxy."""
return SequenceMatcher(None, a.lower(), b.lower()).ratio()
def route_with_consistency_check(
router: CascadeRouter,
query: str,
consistency_threshold: float = 0.75,
) -> RouteResult:
"""
Sample tier 0 twice. If responses diverge, escalate to tier 1.
"""
messages = [{"role": "user", "content": query}]
tier0 = router.tiers[0]
sample_a = router._call_model(tier0, messages)["choices"][0]["message"]["content"]
sample_b = router._call_model(tier0, messages)["choices"][0]["message"]["content"]
similarity = semantic_similarity(sample_a, sample_b)
if similarity >= consistency_threshold:
# High confidence — first sample is good enough
return router.route(query, start_tier=0)
else:
# Low confidence — skip to tier 1
logger.info(f"Low consistency ({similarity:.2f}) — escalating to tier 1.")
return router.route(query, start_tier=1)
Tip for OpenAI users: The
logprobsparameter returns token-level log probabilities. Averaging these gives a more precise confidence score than string similarity. Low average log prob = high uncertainty. This is a drop-in replacement for the similarity check and requires only one API call instead of two.
The two-sample approach adds latency (two Tier 0 calls vs. one), so apply it selectively — for example, only when your upfront classifier returns MEDIUM complexity.
Tuning Cost Thresholds and Quality Floors at Runtime
Hardcoding thresholds is a trap. Query volumes, quality requirements, and cost budgets shift constantly in production. Your router’s behavior shouldn’t require a code deploy to adjust.
Store routing policy in a simple config object that reloads from disk:
import json
from dataclasses import dataclass, asdict
@dataclass
class RouterPolicy:
max_cost_per_query_usd: float = 0.05
consistency_threshold: float = 0.75
complex_min_tier: int = 1
enable_cascade: bool = True
simple_max_tokens: int = 512
def load_policy(path: str = "router_policy.json") -> RouterPolicy:
try:
with open(path) as f:
data = json.load(f)
return RouterPolicy(**data)
except FileNotFoundError:
return RouterPolicy() # sensible defaults
def save_policy(policy: RouterPolicy, path: str = "router_policy.json"):
with open(path, "w") as f:
json.dump(asdict(policy), f, indent=2)
With this in place, tightening budgets under traffic spikes, lowering quality thresholds during an incident, or disabling cascading for a debugging session all become one-line config changes — no application restart required.
Production Hardening: Circuit Breakers, Timeouts, and Cost Logging
A router that works in development will fail in subtle ways in production without these three additions.
Circuit breakers
If a model is consistently erroring, stop sending to it immediately instead of waiting for per-request timeouts to expire:
from collections import deque
from datetime import datetime, timedelta
class CircuitBreaker:
def __init__(self, failure_threshold: int = 5, recovery_seconds: int = 60):
self.failures: deque = deque()
self.threshold = failure_threshold
self.recovery = timedelta(seconds=recovery_seconds)
self.open_until: Optional[datetime] = None
def record_failure(self):
now = datetime.utcnow()
self.failures.append(now)
# Prune failures outside the recovery window
while self.failures and now - self.failures[0] > self.recovery:
self.failures.popleft()
if len(self.failures) >= self.threshold:
self.open_until = now + self.recovery
logger.warning("Circuit breaker opened — skipping this model.")
def is_open(self) -> bool:
if self.open_until and datetime.utcnow() < self.open_until:
return True
self.open_until = None
return False
Attach one CircuitBreaker instance per ModelConfig and check is_open() at the top of each cascade iteration. If open, increment escalations and continue to the next tier.
Per-request cost logging
Every RouteResult already carries cost, token counts, and latency. Log it as structured JSON:
def log_route_result(query_id: str, result: RouteResult):
record = {
"query_id": query_id,
"model": result.model_used,
"input_tokens": result.input_tokens,
"output_tokens": result.output_tokens,
"cost_usd": result.cost_usd,
"escalations": result.escalations,
"latency_ms": result.latency_ms,
"ts": time.time(),
}
logger.info(json.dumps(record))
Pipe this into Datadog, Grafana Loki, or CloudWatch, and you have real-time cost-per-query visibility with zero additional tooling.
Per-model timeout budgets
Set timeout_seconds tightly for cheap models (5–8s) and generously for premium ones (25–30s). When httpx.TimeoutException fires, the cascade loop handles escalation automatically — no extra error handling needed.
Benchmarking Your Router Before You Ship It
Never deploy a router without measuring it. You need two numbers: average cost-per-query and quality delta versus routing everything to your best model.
Build an evaluation set of 100–200 representative queries with known-good reference outputs. Run them through both the router and your all-premium baseline, then compare:
from rouge_score import rouge_scorer as rs
def evaluate_router(
test_cases: list[dict], # [{"query": ..., "reference": ...}]
router: CascadeRouter,
policy: RouterPolicy,
) -> dict:
scorer = rs.RougeScorer(["rougeL"], use_stemmer=True)
total_cost = 0.0
rouge_scores = []
for case in test_cases:
result = router.route(
case["query"],
max_cost_usd=policy.max_cost_per_query_usd,
)
total_cost += result.cost_usd
score = scorer.score(case["reference"], result.response)
rouge_scores.append(score["rougeL"].fmeasure)
n = len(test_cases)
return {
"avg_quality_rougeL": round(sum(rouge_scores) / n, 4),
"avg_cost_per_query_usd": round(total_cost / n, 6),
"total_cost_usd": round(total_cost, 4),
"queries_evaluated": n,
}
Compare avg_quality_rougeL against your all-Tier-2 baseline. If quality drops by more than your acceptable threshold — typically 5–10% relative — tighten consistency_threshold or raise complex_min_tier in your policy config.
The research validates what you should expect. RouteLLM’s router achieved 95% of GPT-4 performance using only 26% GPT-4 calls — a roughly 3.66x cost reduction (ICLR 2025). A well-calibrated cascade should land you in that range once your evaluation loop is dialed in.
Run this benchmark on real production traffic samples, not synthetic queries. The distribution of query complexity in your actual workload is what determines your savings — and it’s almost always more favorable than you expect.
Start Saving Before You Optimize Further
If you route every LLM request to your most expensive model, you’re not just overpaying — you’re leaving a 60–80% cost reduction on the table for the majority of your traffic, with no meaningful quality tradeoff.
The LLM model router in Python above gives you everything you need to change that: heuristic complexity classification, self-consistency checks for confidence-based escalation, circuit breakers for reliability, and structured cost logging for observability. It’s under 200 lines of Python and speaks directly to provider REST APIs — swap in any combination of OpenAI, Anthropic, Groq, or local models without touching the routing logic.
Start by profiling your actual traffic. Log the complexity distribution of real queries for a week, identify which tasks consistently succeed on your cheapest tier, and deploy the router with conservative thresholds. Then iterate based on your benchmarks. Your inference bill will reflect the difference within the first billing cycle.
Have questions about adapting this to your specific provider mix or query types? Drop them in the comments — the patterns here generalize further than they might first appear.