If you’re sending every LLM request to GPT-4 or Claude Opus, you’re almost certainly overpaying by 3–10×. The premium models are exceptional — but they cost $30–60 per million tokens while capable budget models cost $0.10–0.50. That’s a 60–300× price gap for calls that often don’t need top-tier reasoning.
Building an LLM router in Python is how you stop paying premium prices for budget-level tasks. In this tutorial, you’ll build a production-ready router from scratch — no LangChain, no LiteLLM, no framework lock-in — that scores prompt complexity, selects models by cost and latency, fails over automatically with a circuit breaker, and logs every decision so you can tune it over time.
Why Your Single-Model Setup Is Costing You 3–10× More Than It Should
The cost differential between frontier and budget models isn’t marginal — it’s transformational. As of early 2026, GPT-4-class and Claude Opus models cost $30–60 per million tokens; small open models like Llama 3 or Mistral cost $0.10–0.50 per million tokens — a 60–300× spread ([CloudIDR, 2026](https://www.cloudidr.com/blog/llm-pricing-comparison-2026)).
At 100,000 requests per day, each averaging 500 tokens, that’s 50 million tokens/day. At GPT-4 pricing ($40/M), that’s $2,000/day. Route 70% of those requests to a $0.20/M budget model and you’re looking at ~$400/day — without touching output quality for routine tasks.
The compounding gets worse in agent loops. If your agent makes 8 sequential LLM calls per user request and you’re using a premium model for all of them, every dollar of cost multiplies by 8. A router that sends only the final synthesis step to GPT-4 and intermediate steps to a budget model reclaims that multiplier immediately.
A 2025 analysis of 86,000 developers found that 40–60% of LLM API budgets go to operational inefficiencies — over-routing to premium models, no caching, no fallback logic ([PremAI, 2026](https://blog.premai.io/llm-cost-optimization-8-strategies-that-cut-api-spend-by-80-2026-guide/)). The fix isn’t more budget — it’s smarter routing.
The Anatomy of a Production LLM Router in Python
Before writing any code, understand the three layers every production router needs:
- Complexity Scorer — evaluates each prompt and assigns a complexity tier (low / medium / high). This is the decision engine.
- Model Registry — a configuration of available models with their per-token costs, tier, and health status.
- Dispatch Engine — selects the right model using the complexity score, live latency data, cost constraints, and circuit breaker state. Executes the call and logs the outcome.
These three layers are loosely coupled by design. You can swap out the scorer, add a model to the registry, or change routing thresholds independently — without touching the other components.
Building a Complexity Scorer in Pure Python
You don’t need an ML model to score prompt complexity. A combination of keyword detection, prompt length, and structural signals covers the majority of routing decisions correctly — and it runs in microseconds.
“`python
import re
from dataclasses import dataclass
COMPLEX_KEYWORDS = {
“analyze”, “compare”, “explain why”, “pros and cons”, “tradeoffs”,
“step by step”, “debug”, “refactor”, “write a function”, “evaluate”,
“summarize”, “synthesize”, “argue”, “critique”
}
SIMPLE_KEYWORDS = {
“translate”, “spell check”, “convert”, “list”, “what is”,
“define”, “extract”, “yes or no”, “true or false”
}
@dataclass
class ComplexityScore:
tier: str # “low”, “medium”, “high”
score: float # 0.0 – 1.0
signals: list[str]
def score_complexity(prompt: str) -> ComplexityScore:
signals = []
score = 0.0
lower = prompt.lower()
# Signal 1: Keyword detection
if any(kw in lower for kw in COMPLEX_KEYWORDS):
score += 0.4
signals.append(“complex_keyword”)
if any(kw in lower for kw in SIMPLE_KEYWORDS):
score -= 0.2
signals.append(“simple_keyword”)
# Signal 2: Prompt length
token_estimate = len(prompt.split())
if token_estimate > 500:
score += 0.3
signals.append(“long_prompt”)
elif token_estimate < 50:
score -= 0.1
signals.append(“short_prompt”)
# Signal 3: Structural complexity
if ““`” in prompt or re.search(r”\bdef \w+\b|\bclass \w+\b”, prompt):
score += 0.2
signals.append(“code_block”)
if prompt.count(“?”) > 2:
score += 0.1
signals.append(“multi_question”)
score = max(0.0, min(1.0, score))
tier = “high” if score >= 0.6 else “medium” if score >= 0.3 else “low”
return ComplexityScore(tier=tier, score=score, signals=signals)
“`
This scorer is intentionally transparent. Every decision is captured in `signals`, which feed directly into your routing logs. You can tune the thresholds after deployment once you see real traffic data — and you’ll have the signal names to guide you.
Defining Your Model Registry with Cost-Per-Token Pricing
The model registry is a plain Python dataclass that centralizes your model configuration. Define it once; reference it everywhere.
“`python
from dataclasses import dataclass
from typing import Optional
@dataclass
class ModelConfig:
name: str
provider: str
tier: str # “budget”, “mid”, “premium”
input_cost_per_1k: float # USD per 1,000 input tokens
output_cost_per_1k: float # USD per 1,000 output tokens
max_tokens: int
timeout_s: float = 30.0
MODEL_REGISTRY = {
“budget”: ModelConfig(
name=”llama-3-8b”,
provider=”groq”,
tier=”budget”,
input_cost_per_1k=0.00005,
output_cost_per_1k=0.00008,
max_tokens=8192,
),
“mid”: ModelConfig(
name=”claude-haiku-3-5″,
provider=”anthropic”,
tier=”mid”,
input_cost_per_1k=0.00080,
output_cost_per_1k=0.00400,
max_tokens=200000,
),
“premium”: ModelConfig(
name=”claude-opus-4″,
provider=”anthropic”,
tier=”premium”,
input_cost_per_1k=0.01500,
output_cost_per_1k=0.07500,
max_tokens=200000,
),
}
COMPLEXITY_TO_TIER = {“low”: “budget”, “medium”: “mid”, “high”: “premium”}
“`
Two rules to follow here. First, keep `input_cost_per_1k` and `output_cost_per_1k` separate — output tokens are typically 3–5× more expensive than input tokens, and conflating them makes your cost estimates wrong. Second, version-pin your model names. When a provider releases a new model, you decide when to adopt it.
Implementing Cost-Based Routing and Latency Tracking
With a complexity score and a model registry, the dispatch engine can now select a model. Complexity alone isn’t enough, though — you also need real-time latency tracking to avoid sending time-sensitive requests to a slow provider.
“`python
import time
from collections import deque
class LatencyTracker:
def __init__(self, window: int = 50):
self._samples: dict[str, deque] = {}
self.window = window
def record(self, model_name: str, latency_s: float):
if model_name not in self._samples:
self._samples[model_name] = deque(maxlen=self.window)
self._samples[model_name].append(latency_s)
def p95(self, model_name: str) -> Optional[float]:
samples = self._samples.get(model_name)
if not samples or len(samples) < 5:
return None
sorted_s = sorted(samples)
return sorted_s[int(len(sorted_s) * 0.95)]
latency_tracker = LatencyTracker()
def select_model(
complexity: ComplexityScore,
max_latency_s: Optional[float] = None
) -> ModelConfig:
preferred_tier = COMPLEXITY_TO_TIER[complexity.tier]
tier_order = [“budget”, “mid”, “premium”]
for tier in tier_order[tier_order.index(preferred_tier):]:
model = MODEL_REGISTRY[tier]
if max_latency_s is not None:
p95 = latency_tracker.p95(model.name)
if p95 is not None and p95 > max_latency_s:
continue # Too slow right now — step up
return model
return MODEL_REGISTRY[“premium”] # Final fallback
“`
The `max_latency_s` parameter is the key addition. If a budget provider’s P95 latency is 8 seconds when your SLA requires 3, the router automatically steps up to the next tier rather than failing the request. A routing system with automatic latency-aware failover has demonstrated a 5× improvement in SLO attainment and 31.6% latency reduction versus single-provider deployments ([Machine Learning Plus, 2025](https://machinelearningplus.com/gen-ai/multi-provider-llm-router/)).
Adding Production-Grade Failover with a Circuit Breaker
A simple `try/except` retry loop is not a circuit breaker. When a provider is having a sustained outage, retrying every request wastes time and compounds latency — especially painful in agent loops. You need a state machine that learns from failure history.
The circuit breaker has three states:
- Closed — normal operation; requests pass through
- Open — provider is unhealthy; fail fast without attempting the call
- Half-open — provider might be recovering; send one probe request to check
“`python
from enum import Enum
class CircuitState(Enum):
CLOSED = “closed”
OPEN = “open”
HALF_OPEN = “half_open”
class CircuitBreaker:
def __init__(self, failure_threshold: int = 5, recovery_timeout_s: float = 60.0):
self.failure_threshold = failure_threshold
self.recovery_timeout_s = recovery_timeout_s
self._failures = 0
self._state = CircuitState.CLOSED
self._opened_at: Optional[float] = None
@property
def state(self) -> CircuitState:
if self._state == CircuitState.OPEN:
elapsed = time.monotonic() – self._opened_at
if elapsed > self.recovery_timeout_s:
self._state = CircuitState.HALF_OPEN
return self._state
def record_success(self):
self._failures = 0
self._state = CircuitState.CLOSED
def record_failure(self):
self._failures += 1
if self._failures >= self.failure_threshold:
self._state = CircuitState.OPEN
self._opened_at = time.monotonic()
def allow_request(self) -> bool:
return self.state != CircuitState.OPEN
circuit_breakers: dict[str, CircuitBreaker] = {}
def get_breaker(provider: str) -> CircuitBreaker:
if provider not in circuit_breakers:
circuit_breakers[provider] = CircuitBreaker()
return circuit_breakers[provider]
“`
Now plug this into your dispatch function. Before executing a call, check the breaker. After execution, record success or failure.
“`python
def dispatch(prompt: str, max_latency_s: Optional[float] = None) -> dict:
complexity = score_complexity(prompt)
tier_order = [“budget”, “mid”, “premium”]
start_tier = COMPLEXITY_TO_TIER[complexity.tier]
for tier in tier_order[tier_order.index(start_tier):]:
candidate = MODEL_REGISTRY[tier]
breaker = get_breaker(candidate.provider)
if not breaker.allow_request():
continue # Circuit is open — skip this provider
start = time.monotonic()
try:
response = call_llm(candidate, prompt) # Your actual API call
latency = time.monotonic() – start
latency_tracker.record(candidate.name, latency)
breaker.record_success()
return {“model”: candidate.name, “response”: response, “latency”: latency}
except Exception:
breaker.record_failure()
continue # Try next tier
raise RuntimeError(“All providers exhausted”)
“`
This structure prevents your router from hammering a broken provider. Your application gets a fast failure response instead of waiting through multiple timeout-length retries in sequence.
Logging Every Routing Decision for Observability
A router without observability is a black box you can’t improve. Log every dispatch as a structured event — even the ones that seem routine.
“`python
import json
import logging
from dataclasses import dataclass
logger = logging.getLogger(“llm_router”)
@dataclass
class RoutingEvent:
request_id: str
timestamp: float
prompt_tokens_est: int
complexity_tier: str
complexity_score: float
complexity_signals: list[str]
selected_model: str
selected_provider: str
selection_reason: str # “complexity_match” | “latency_fallback” | “circuit_fallback”
estimated_cost_usd: float
actual_latency_s: float
success: bool
error: Optional[str] = None
def log_routing_event(event: RoutingEvent):
logger.info(json.dumps({“event”: “llm_routing_decision”, **event.__dict__}))
“`
The `selection_reason` field is the most valuable signal for tuning. If `circuit_fallback` dominates your logs, a provider is having reliability issues. If `latency_fallback` spikes, your P95 threshold may be too tight. If `complexity_match` is consistent, your scorer is working.
Route these logs to your observability stack (Datadog, Grafana, CloudWatch) and set alerts on `circuit_fallback` rate and `estimated_cost_usd` per hour. You’ll catch provider degradation before it impacts users.
The Real Cost Math — What 70/20/10 Traffic Splitting Actually Saves
The “60–80% cost reduction” headline is everywhere, but the math behind it rarely gets shown. Here’s what the numbers look like.
Assumptions (per request):
- 300 input tokens + 200 output tokens = 500 tokens
- Budget: $0.05/1K input + $0.08/1K output → $0.031/request
- Mid: $0.80/1K input + $4.00/1K output → $1.04/request
- Premium: $15.00/1K input + $75.00/1K output → $19.50/request
At 10,000 requests/day with a 70/20/10 split:
| Tier | Requests | Cost/req | Daily cost |
|—|—|—|—|
| Budget (70%) | 7,000 | $0.031 | $217 |
| Mid (20%) | 2,000 | $1.04 | $2,080 |
| Premium (10%) | 1,000 | $19.50 | $19,500 |
| Total | | | $21,797 |
vs. 100% premium routing: 10,000 × $19.50 = $195,000/day
That’s an 89% cost reduction. Even a conservative 50/30/20 split yields ~$77,700/day — still a 60% reduction. Research confirms a 3-tier routing setup directing 70% of queries to budget, 20% to mid-tier, and 10% to premium reduces average per-query cost by 60–80% ([CloudIDR / PremAI, 2026](https://blog.premai.io/llm-cost-optimization-8-strategies-that-cut-api-spend-by-80-2026-guide/)).
At 1M requests/day, the numbers scale linearly — and the routing logic that produces them took hours to build.
When to Ditch Your Custom Router for LiteLLM or RouteLLM
Custom routers give you complete control, but they come with maintenance costs. Here’s an honest look at when the tradeoff tips toward a framework.
Stick with your custom router if:
- You need sub-millisecond routing latency — agent loops where latency compounds on every sequential call
- Your routing logic uses domain-specific signals a general framework won’t support
- You need full observability without paying for a vendor dashboard
- Your team can own and test the routing code
Consider LiteLLM if:
- You want a drop-in proxy with 100+ provider integrations out of the box
- You need load balancing, caching, and rate limiting without building them yourself
- You can accept ~500µs per-request overhead — acceptable for single API calls, but it compounds in multi-step agent loops ([Maxim.ai, 2026](https://www.getmaxim.ai/articles/top-5-llm-router-solutions-in-2026/))
Consider RouteLLM if:
- You want ML-powered routing decisions rather than heuristic-based ones
- Quality vs. cost optimization is your primary goal — RouteLLM’s matrix factorization router achieved 85% cost reduction while maintaining 95% of GPT-4 performance on MT-Bench, requiring only 14% GPT-4 calls ([UC Berkeley / LMSYS, ICLR 2025](https://www.lmsys.org/blog/2024-07-01-routellm/))
- You have enough traffic volume to make model-based routing worth the added complexity
The honest recommendation: start with your own router. You’ll understand the problem domain deeply, your routing logic will be tight and debuggable, and you can always migrate to a framework once you’ve hit the ceiling of what heuristics can do.
Putting It All Together
The complete LLM router in Python — scorer, registry, latency tracker, circuit breaker, dispatch loop, and logging — fits in under 300 lines of pure Python. The components snap together in a single `dispatch()` call:
- Score the incoming prompt with `score_complexity()`
- Select a candidate model with `select_model()`
- Check the circuit breaker before dispatching
- Execute the call, record latency and outcome
- Log the `RoutingEvent` to your observability stack
Every routing decision is now explainable. Every provider failure is detected automatically. Every dollar of API cost is earned by a request that genuinely needed that tier.
Start with the complexity scorer and a two-tier registry (budget + premium) targeting your highest-volume endpoint. Instrument the logs, run it for a week, and let real traffic data tell you where to tune the thresholds. The 60% cost reduction isn’t a headline — it’s a calculation you can verify in your first billing cycle.