Build an LLM Router in Python: Cut API Costs by 60%

If you’re sending every LLM request to GPT-4 or Claude Opus, you’re almost certainly overpaying by 3–10×. The premium models are exceptional — but they cost $30–60 per million tokens while capable budget models cost $0.10–0.50. That’s a 60–300× price gap for calls that often don’t need top-tier reasoning.

Building an LLM router in Python is how you stop paying premium prices for budget-level tasks. In this tutorial, you’ll build a production-ready router from scratch — no LangChain, no LiteLLM, no framework lock-in — that scores prompt complexity, selects models by cost and latency, fails over automatically with a circuit breaker, and logs every decision so you can tune it over time.

Why Your Single-Model Setup Is Costing You 3–10× More Than It Should

The cost differential between frontier and budget models isn’t marginal — it’s transformational. As of early 2026, GPT-4-class and Claude Opus models cost $30–60 per million tokens; small open models like Llama 3 or Mistral cost $0.10–0.50 per million tokens — a 60–300× spread ([CloudIDR, 2026](https://www.cloudidr.com/blog/llm-pricing-comparison-2026)).

At 100,000 requests per day, each averaging 500 tokens, that’s 50 million tokens/day. At GPT-4 pricing ($40/M), that’s $2,000/day. Route 70% of those requests to a $0.20/M budget model and you’re looking at ~$400/day — without touching output quality for routine tasks.

The compounding gets worse in agent loops. If your agent makes 8 sequential LLM calls per user request and you’re using a premium model for all of them, every dollar of cost multiplies by 8. A router that sends only the final synthesis step to GPT-4 and intermediate steps to a budget model reclaims that multiplier immediately.

A 2025 analysis of 86,000 developers found that 40–60% of LLM API budgets go to operational inefficiencies — over-routing to premium models, no caching, no fallback logic ([PremAI, 2026](https://blog.premai.io/llm-cost-optimization-8-strategies-that-cut-api-spend-by-80-2026-guide/)). The fix isn’t more budget — it’s smarter routing.

The Anatomy of a Production LLM Router in Python

Before writing any code, understand the three layers every production router needs:

Complexity Scorer — evaluates each prompt and assigns a complexity tier (low / medium / high). This is the decision engine.
Model Registry — a configuration of available models with their per-token costs, tier, and health status.
Dispatch Engine — selects the right model using the complexity score, live latency data, cost constraints, and circuit breaker state. Executes the call and logs the outcome.

These three layers are loosely coupled by design. You can swap out the scorer, add a model to the registry, or change routing thresholds independently — without touching the other components.

Building a Complexity Scorer in Pure Python

You don’t need an ML model to score prompt complexity. A combination of keyword detection, prompt length, and structural signals covers the majority of routing decisions correctly — and it runs in microseconds.

“`python

import re

from dataclasses import dataclass

COMPLEX_KEYWORDS = {

“analyze”, “compare”, “explain why”, “pros and cons”, “tradeoffs”,

“step by step”, “debug”, “refactor”, “write a function”, “evaluate”,

“summarize”, “synthesize”, “argue”, “critique”

}

SIMPLE_KEYWORDS = {

“translate”, “spell check”, “convert”, “list”, “what is”,

“define”, “extract”, “yes or no”, “true or false”

}

@dataclass

class ComplexityScore:

tier: str # “low”, “medium”, “high”

score: float # 0.0 – 1.0

signals: list[str]

def score_complexity(prompt: str) -> ComplexityScore:

signals = []

score = 0.0

lower = prompt.lower()

# Signal 1: Keyword detection

if any(kw in lower for kw in COMPLEX_KEYWORDS):

score += 0.4

signals.append(“complex_keyword”)

if any(kw in lower for kw in SIMPLE_KEYWORDS):

score -= 0.2

signals.append(“simple_keyword”)

# Signal 2: Prompt length

token_estimate = len(prompt.split())

if token_estimate > 500:

score += 0.3

signals.append(“long_prompt”)

elif token_estimate < 50:

score -= 0.1

signals.append(“short_prompt”)

# Signal 3: Structural complexity

if ““`” in prompt or re.search(r”\bdef \w+\b|\bclass \w+\b”, prompt):

score += 0.2

signals.append(“code_block”)

if prompt.count(“?”) > 2:

score += 0.1

signals.append(“multi_question”)

score = max(0.0, min(1.0, score))

tier = “high” if score >= 0.6 else “medium” if score >= 0.3 else “low”

return ComplexityScore(tier=tier, score=score, signals=signals)

“`

This scorer is intentionally transparent. Every decision is captured in `signals`, which feed directly into your routing logs. You can tune the thresholds after deployment once you see real traffic data — and you’ll have the signal names to guide you.

Defining Your Model Registry with Cost-Per-Token Pricing

The model registry is a plain Python dataclass that centralizes your model configuration. Define it once; reference it everywhere.

“`python

from dataclasses import dataclass

from typing import Optional

@dataclass

class ModelConfig:

name: str

provider: str

tier: str # “budget”, “mid”, “premium”

input_cost_per_1k: float # USD per 1,000 input tokens

output_cost_per_1k: float # USD per 1,000 output tokens

max_tokens: int

timeout_s: float = 30.0

MODEL_REGISTRY = {

“budget”: ModelConfig(

name=”llama-3-8b”,

provider=”groq”,

tier=”budget”,

input_cost_per_1k=0.00005,

output_cost_per_1k=0.00008,

max_tokens=8192,

“mid”: ModelConfig(

name=”claude-haiku-3-5″,

provider=”anthropic”,

tier=”mid”,

input_cost_per_1k=0.00080,

output_cost_per_1k=0.00400,

max_tokens=200000,

“premium”: ModelConfig(

name=”claude-opus-4″,

provider=”anthropic”,

tier=”premium”,

input_cost_per_1k=0.01500,

output_cost_per_1k=0.07500,

max_tokens=200000,

}

COMPLEXITY_TO_TIER = {“low”: “budget”, “medium”: “mid”, “high”: “premium”}

“`

Two rules to follow here. First, keep `input_cost_per_1k` and `output_cost_per_1k` separate — output tokens are typically 3–5× more expensive than input tokens, and conflating them makes your cost estimates wrong. Second, version-pin your model names. When a provider releases a new model, you decide when to adopt it.

Implementing Cost-Based Routing and Latency Tracking

With a complexity score and a model registry, the dispatch engine can now select a model. Complexity alone isn’t enough, though — you also need real-time latency tracking to avoid sending time-sensitive requests to a slow provider.

“`python

import time

from collections import deque

class LatencyTracker:

def __init__(self, window: int = 50):

self._samples: dict[str, deque] = {}

self.window = window

def record(self, model_name: str, latency_s: float):

if model_name not in self._samples:

self._samples[model_name] = deque(maxlen=self.window)

self._samples[model_name].append(latency_s)

def p95(self, model_name: str) -> Optional[float]:

samples = self._samples.get(model_name)

if not samples or len(samples) < 5:

return None

sorted_s = sorted(samples)

return sorted_s[int(len(sorted_s) * 0.95)]

latency_tracker = LatencyTracker()

def select_model(

complexity: ComplexityScore,

max_latency_s: Optional[float] = None

) -> ModelConfig:

preferred_tier = COMPLEXITY_TO_TIER[complexity.tier]

tier_order = [“budget”, “mid”, “premium”]

for tier in tier_order[tier_order.index(preferred_tier):]:

model = MODEL_REGISTRY[tier]

if max_latency_s is not None:

p95 = latency_tracker.p95(model.name)

if p95 is not None and p95 > max_latency_s:

continue # Too slow right now — step up

return model

return MODEL_REGISTRY[“premium”] # Final fallback

“`

The `max_latency_s` parameter is the key addition. If a budget provider’s P95 latency is 8 seconds when your SLA requires 3, the router automatically steps up to the next tier rather than failing the request. A routing system with automatic latency-aware failover has demonstrated a 5× improvement in SLO attainment and 31.6% latency reduction versus single-provider deployments ([Machine Learning Plus, 2025](https://machinelearningplus.com/gen-ai/multi-provider-llm-router/)).

Adding Production-Grade Failover with a Circuit Breaker

A simple `try/except` retry loop is not a circuit breaker. When a provider is having a sustained outage, retrying every request wastes time and compounds latency — especially painful in agent loops. You need a state machine that learns from failure history.

The circuit breaker has three states:

Closed — normal operation; requests pass through
Open — provider is unhealthy; fail fast without attempting the call
Half-open — provider might be recovering; send one probe request to check

“`python

from enum import Enum

class CircuitState(Enum):

CLOSED = “closed”

OPEN = “open”

HALF_OPEN = “half_open”

class CircuitBreaker:

def __init__(self, failure_threshold: int = 5, recovery_timeout_s: float = 60.0):

self.failure_threshold = failure_threshold

self.recovery_timeout_s = recovery_timeout_s

self._failures = 0

self._state = CircuitState.CLOSED

self._opened_at: Optional[float] = None

@property

def state(self) -> CircuitState:

if self._state == CircuitState.OPEN:

elapsed = time.monotonic() – self._opened_at

if elapsed > self.recovery_timeout_s:

self._state = CircuitState.HALF_OPEN

return self._state

def record_success(self):

self._failures = 0

self._state = CircuitState.CLOSED

def record_failure(self):

self._failures += 1

if self._failures >= self.failure_threshold:

self._state = CircuitState.OPEN

self._opened_at = time.monotonic()

def allow_request(self) -> bool:

return self.state != CircuitState.OPEN

circuit_breakers: dict[str, CircuitBreaker] = {}

def get_breaker(provider: str) -> CircuitBreaker:

if provider not in circuit_breakers:

circuit_breakers[provider] = CircuitBreaker()

return circuit_breakers[provider]

“`

Now plug this into your dispatch function. Before executing a call, check the breaker. After execution, record success or failure.

“`python

def dispatch(prompt: str, max_latency_s: Optional[float] = None) -> dict:

complexity = score_complexity(prompt)

tier_order = [“budget”, “mid”, “premium”]

start_tier = COMPLEXITY_TO_TIER[complexity.tier]

for tier in tier_order[tier_order.index(start_tier):]:

candidate = MODEL_REGISTRY[tier]

breaker = get_breaker(candidate.provider)

if not breaker.allow_request():

continue # Circuit is open — skip this provider

start = time.monotonic()

try:

response = call_llm(candidate, prompt) # Your actual API call

latency = time.monotonic() – start

latency_tracker.record(candidate.name, latency)

breaker.record_success()

return {“model”: candidate.name, “response”: response, “latency”: latency}

except Exception:

breaker.record_failure()

continue # Try next tier

raise RuntimeError(“All providers exhausted”)

“`

This structure prevents your router from hammering a broken provider. Your application gets a fast failure response instead of waiting through multiple timeout-length retries in sequence.

Logging Every Routing Decision for Observability

A router without observability is a black box you can’t improve. Log every dispatch as a structured event — even the ones that seem routine.

“`python

import json

import logging

from dataclasses import dataclass

logger = logging.getLogger(“llm_router”)

@dataclass

class RoutingEvent:

request_id: str

timestamp: float

prompt_tokens_est: int

complexity_tier: str

complexity_score: float

complexity_signals: list[str]

selected_model: str

selected_provider: str

selection_reason: str # “complexity_match” | “latency_fallback” | “circuit_fallback”

estimated_cost_usd: float

actual_latency_s: float

success: bool

error: Optional[str] = None

def log_routing_event(event: RoutingEvent):

logger.info(json.dumps({“event”: “llm_routing_decision”, **event.__dict__}))

“`

The `selection_reason` field is the most valuable signal for tuning. If `circuit_fallback` dominates your logs, a provider is having reliability issues. If `latency_fallback` spikes, your P95 threshold may be too tight. If `complexity_match` is consistent, your scorer is working.

Route these logs to your observability stack (Datadog, Grafana, CloudWatch) and set alerts on `circuit_fallback` rate and `estimated_cost_usd` per hour. You’ll catch provider degradation before it impacts users.

The Real Cost Math — What 70/20/10 Traffic Splitting Actually Saves

The “60–80% cost reduction” headline is everywhere, but the math behind it rarely gets shown. Here’s what the numbers look like.

Assumptions (per request):

300 input tokens + 200 output tokens = 500 tokens
Budget: $0.05/1K input + $0.08/1K output → $0.031/request
Mid: $0.80/1K input + $4.00/1K output → $1.04/request
Premium: $15.00/1K input + $75.00/1K output → $19.50/request

At 10,000 requests/day with a 70/20/10 split:

|—|—|—|—|

| Budget (70%) | 7,000 | $0.031 | $217 |

| Mid (20%) | 2,000 | $1.04 | $2,080 |

| Premium (10%) | 1,000 | $19.50 | $19,500 |

| Total | | | $21,797 |

vs. 100% premium routing: 10,000 × $19.50 = $195,000/day

That’s an 89% cost reduction. Even a conservative 50/30/20 split yields ~$77,700/day — still a 60% reduction. Research confirms a 3-tier routing setup directing 70% of queries to budget, 20% to mid-tier, and 10% to premium reduces average per-query cost by 60–80% ([CloudIDR / PremAI, 2026](https://blog.premai.io/llm-cost-optimization-8-strategies-that-cut-api-spend-by-80-2026-guide/)).

At 1M requests/day, the numbers scale linearly — and the routing logic that produces them took hours to build.

When to Ditch Your Custom Router for LiteLLM or RouteLLM

Custom routers give you complete control, but they come with maintenance costs. Here’s an honest look at when the tradeoff tips toward a framework.

Stick with your custom router if:

You need sub-millisecond routing latency — agent loops where latency compounds on every sequential call
Your routing logic uses domain-specific signals a general framework won’t support
You need full observability without paying for a vendor dashboard
Your team can own and test the routing code

Consider LiteLLM if:

You want a drop-in proxy with 100+ provider integrations out of the box
You need load balancing, caching, and rate limiting without building them yourself
You can accept ~500µs per-request overhead — acceptable for single API calls, but it compounds in multi-step agent loops ([Maxim.ai, 2026](https://www.getmaxim.ai/articles/top-5-llm-router-solutions-in-2026/))

Consider RouteLLM if:

You want ML-powered routing decisions rather than heuristic-based ones
Quality vs. cost optimization is your primary goal — RouteLLM’s matrix factorization router achieved 85% cost reduction while maintaining 95% of GPT-4 performance on MT-Bench, requiring only 14% GPT-4 calls ([UC Berkeley / LMSYS, ICLR 2025](https://www.lmsys.org/blog/2024-07-01-routellm/))
You have enough traffic volume to make model-based routing worth the added complexity

The honest recommendation: start with your own router. You’ll understand the problem domain deeply, your routing logic will be tight and debuggable, and you can always migrate to a framework once you’ve hit the ceiling of what heuristics can do.

Putting It All Together

The complete LLM router in Python — scorer, registry, latency tracker, circuit breaker, dispatch loop, and logging — fits in under 300 lines of pure Python. The components snap together in a single `dispatch()` call:

Score the incoming prompt with `score_complexity()`
Select a candidate model with `select_model()`
Check the circuit breaker before dispatching
Execute the call, record latency and outcome
Log the `RoutingEvent` to your observability stack

Every routing decision is now explainable. Every provider failure is detected automatically. Every dollar of API cost is earned by a request that genuinely needed that tier.

Start with the complexity scorer and a two-tier registry (budget + premium) targeting your highest-volume endpoint. Instrument the logs, run it for a week, and let real traffic data tell you where to tune the thresholds. The 60% cost reduction isn’t a headline — it’s a calculation you can verify in your first billing cycle.

Why Your Single-Model Setup Is Costing You 3–10× More Than It Should

The Anatomy of a Production LLM Router in Python

Building a Complexity Scorer in Pure Python

Defining Your Model Registry with Cost-Per-Token Pricing

Implementing Cost-Based Routing and Latency Tracking

Adding Production-Grade Failover with a Circuit Breaker

Logging Every Routing Decision for Observability

The Real Cost Math — What 70/20/10 Traffic Splitting Actually Saves

When to Ditch Your Custom Router for LiteLLM or RouteLLM

Putting It All Together

Leave a Reply Cancel reply

Related Posts

Quality Gates for AI-Generated Code: CI/CD Setup

4 Context Engineering Patterns for Reliable AI Agents

AI-Generated Code Security: CI/CD Audit Checklist

AI Generated Code Governance CI/CD Checklist