If your MCP deployment has an authentication problem, an observability problem, and a rate-limiting problem all at once — you don’t have three problems. You have one missing infrastructure layer — an MCP gateway.
The Model Context Protocol scaled from roughly 100,000 SDK downloads per month at launch in November 2024 to 97 million per month by December 2025 — a 970× increase in 13 months. That growth curve means a lot of engineers are now running MCP servers that started as demos and became production workloads before anyone thought about governance. Auth bolted on after the fact. Logs that don’t tell you which agent called which tool. One runaway retry loop that can starve everything else.
An MCP gateway is the infrastructure layer that solves all three. Here’s exactly how it works, when you need it, and how to implement the hard parts without modifying a single existing MCP server.
Why Direct Agent-to-Server Connections Break in Production
Start with a diagram most teams draw early on: each agent connects directly to each MCP server it needs. Clean, simple, works in development.
Now add two more agents. Add three more servers. Add a new team onboarding next quarter.
What you have is an N×M mesh — N agents each maintaining direct connections to M servers. Every server has its own auth tokens. Every server logs to its own format. When a server’s endpoint changes, every agent that touches it breaks. When a tool call fails, you have no single place to look.
The numbers make this worse than it sounds. According to the LangChain State of Agent Engineering survey (n=1,340, November–December 2025), 66.4% of AI agent implementations now use multi-agent system designs, meaning a single user request can fan out to multiple MCP servers simultaneously. Salesforce’s 2026 Connectivity Benchmark found organizations currently run an average of 12 agents, projected to grow 67% within two years.
Do the math: 12 agents × 12 servers × (auth tokens + retry policies + logging configs per pair) = a governance surface area that nobody has time to maintain correctly.
A gateway replaces the mesh with a hub. Every agent talks to one endpoint. The gateway handles routing, identity, and observability centrally.
When a server moves, you update one config. When an audit comes, you have one log stream.
MCP Gateway vs. Reverse Proxy: What nginx and Caddy Cannot Do
The instinct to throw a reverse proxy in front of MCP servers is understandable — you probably already have nginx or Caddy running somewhere. The problem is that MCP is not a stateless HTTP API.
The capability gap is concrete:
| Concern | Reverse Proxy (nginx/Caddy) | MCP Gateway |
|---|---|---|
| TLS termination | ✅ | ✅ |
| Load balancing | ✅ | ✅ |
| Stateful session routing | ❌ | ✅ |
| OAuth 2.1 token enforcement | ❌ | ✅ |
| Per-agent identity | ❌ | ✅ |
| Tool-level RBAC | ❌ | ✅ |
| Token-based rate limits | ❌ | ✅ |
| Audit trails by agent | ❌ | ✅ |
| Trace context injection | ❌ | ✅ |
A reverse proxy handles transport-layer concerns. An MCP gateway handles AI-specific governance that traditional API gateways miss entirely. These are not the same class of problem, and conflating them is what leads teams to spend a sprint debugging dropped sessions before realizing their proxy was never designed for what MCP requires.
The Session Routing Problem: Why MCP’s Stateful Transport Breaks Standard Proxies
This is the most underappreciated operational challenge in MCP deployments, and almost nothing in the existing documentation calls it out clearly.
MCP’s Streamable HTTP transport issues Mcp-Session-Id headers. Once a session is established between an agent and an MCP server, every subsequent request in that session must route to the same server instance. This is sticky session routing, and nginx doesn’t do it by default for dynamic session identifiers in custom headers.
If you round-robin an agent’s requests across multiple server instances without sticky routing, the server on instance 2 has no context for the session started on instance 1. The session breaks. The agent throws errors. If your MCP server is stateful — and many are, holding tool execution context between calls — you’ve now corrupted that state.
A proper MCP gateway:
1. Extracts the Mcp-Session-Id from the request header
2. Maps it to the specific backend instance that owns that session
3. Routes consistently for the session lifetime
4. Handles session expiry and cleanup gracefully
You can configure nginx to do sticky routing with ip_hash or a custom hash, but that’s client IP-based — not session-aware. An agent running on shared compute with a rotating source IP will still break sessions. Gateway-native session management solves this at the protocol level — a standard reverse proxy never will.
OAuth 2.1 Token Exchange at the Gateway Layer
The authentication picture for MCP is not flattering. According to Astrix Security’s State of MCP Server Security 2025, 53% of community MCP servers rely on static API keys. Zuplo’s State of MCP Report found 24% have no authentication at all. That same report found 50% of MCP builders cite security and access control as their top challenge, with 38% saying security concerns are actively blocking increased adoption.
The MCP spec formally added OAuth 2.1 on March 26, 2025, and the 2026-03-15 revision mandated RFC 8707 Resource Indicators to prevent token mis-redemption attacks across MCP server boundaries. But spec updates don’t retroactively secure deployed servers.
A gateway lets you enforce OAuth 2.1 at the edge without touching those existing servers. The RFC chain the gateway implements:
- RFC 9728 (Protected Resource Metadata) — the MCP server advertises its authorization server via
/.well-known/oauth-protected-resource - RFC 8414 (Authorization Server Metadata) — the AS advertises its endpoints via
/.well-known/oauth-authorization-server - RFC 7591 (Dynamic Client Registration) — agents self-register without manual provisioning
- Authorization Code + PKCE — the human-in-the-loop flow for interactive agents
- RFC 7662 (Token Introspection) — the gateway validates tokens on every request
The M2M gap nobody covers
Every OAuth 2.1 guide for MCP focuses on the authorization code + PKCE flow because it’s the reference flow in the spec. But fully autonomous agents — ones running in the background without a human at a browser — cannot complete a consent screen.
For these, you need the client credentials flow (M2M / service-to-service). The agent presents a client ID and secret — or a signed JWT assertion — directly to the token endpoint and receives a bearer token with no user interaction. The gateway enforces this token identically to user-delegated tokens via RFC 7662 introspection.
The practical implication: your gateway needs to handle both flows, and your token scopes need to be designed so M2M clients can be granted only the specific tool-level permissions they need — not blanket server access.
Per-Agent Rate Limiting: Why Request Counts Are Not Enough
Standard rate limiting counts requests per time window. For a JSON REST API where every request is roughly the same size, this works fine.
MCP tool calls are not uniform. A search_files call with a small query pattern is trivial. A read_file call on a 200KB codebase file followed by an analyze_code call that sends full content to an LLM backend can consume as much compute as hundreds of smaller calls. One agent in a retry loop executing large-context tool calls can saturate your backend before the request counter trips.
Two-dimensional rate limiting solves this:
- Request dimension: max N tool calls per minute per agent identity
- Token dimension: max T input+output tokens consumed per minute per agent identity
The gateway tracks both dimensions from the response payload. When either limit is hit, it returns a 429 Too Many Requests with a Retry-After header before the request ever reaches your MCP server — protecting the server from traffic it never needs to see.
Token-based quotas also enable cost attribution per agent at the gateway layer. You know exactly which agent consumed which tools at what token cost, feeding directly into your observability pipeline and your chargebacks conversation with the finance team.
Adding Observability with Zero Server Changes
The default observability story for MCP is: nothing. The protocol has no native tracing. Unless your MCP server explicitly emits spans, you have no visibility into tool call latency, error rates, or which agent caused a spike.
The gateway fixes this without a single code change to your existing servers.
Trace context injection via params._meta
The MCP specification allows arbitrary metadata in the params._meta field of any tool call request. The gateway intercepts outgoing requests and injects W3C Trace Context headers (traceparent, tracestate) into params._meta before forwarding to the server.
If the server is instrumented, it picks up the trace context and continues the span tree. If it isn’t, the gateway still owns the root span and produces meaningful traces from its own perspective — capturing latency from agent to gateway to server and back, plus any errors the server returns.
OpenTelemetry semantic conventions for MCP
The OTel community has defined MCP-specific semantic conventions that map cleanly onto standard RED metrics (Rate, Errors, Duration):
gen_ai.tool.name— the tool being calledmcp.method.name— the MCP method (tools/call,resources/read, etc.)gen_ai.usage.input_tokens/gen_ai.usage.output_tokens— token consumption per call
The OpenTelemetry spanmetrics connector automatically derives per-tool RED metrics from these spans without any custom metric code. You get a dashboard showing request rate, error rate, and p99 latency broken down by tool name and agent identity — from gateway instrumentation alone.
If you’re already using Langfuse for AI agent observability, these OTel spans feed directly into your existing trace pipeline.
Bifrost, the open-source Go-based MCP gateway, reports 11-microsecond median latency overhead. MCPX reports ~4ms p99. The observability layer adds no meaningful latency to tool calls.
Choosing a Gateway: A Decision Framework by Existing Infrastructure
The gateway ecosystem has fragmented quickly. Gartner projects 75% of API gateway vendors will integrate MCP-specific features by end of 2026, and most major vendors have already moved. Rather than comparing feature checklists, pick based on what you already run.
If you’re a Kong shop: Kong AI Gateway has native MCP support as a plugin layer. You keep existing Kong configuration patterns, rate limiting policies, and the Konnect dashboard. Least disruption.
If you’re on Kubernetes with Envoy: Envoy AI Gateway (CNCF sandbox) extends Envoy’s xDS config model with MCP-specific filters. Traefik Hub is a lighter-weight alternative in the same K8s-native category. If your team already manages Envoy configs, the mental model transfer is minimal.
If you’re on Azure: Azure API Management added MCP gateway support in 2025. If your agent workloads already run on Azure AI Foundry, keeping the gateway in the same plane simplifies IAM significantly.
If you’re greenfield or small-team: Agentgateway (Rust, open-source, CNCF sandbox), MCPX, and Bifrost are all production-ready with no existing gateway infrastructure required. Bifrost is the easiest to self-host if you want full control over the data plane.
If you need enterprise audit trails without buying new infrastructure: IBM ContextForge adds governance capabilities on top of existing IBM API Connect deployments. Apache APISIX is worth a look for teams already in that ecosystem — it has an active MCP plugin community and strong support for custom auth flows.
When You Don’t Need a Gateway Yet
Not every MCP deployment needs a gateway on day one.
Direct connections work well up to roughly 8–10 MCP servers or 5 developers sharing the deployment. Below that threshold, the operational overhead of running a gateway may not justify the complexity it adds to your stack.
The signal that you’ve crossed the threshold isn’t a number — it’s a question. The first time someone on your team asks “who called what tool, and when?” and you can’t answer it, you need a gateway. That moment also tends to coincide with the first incident you can’t attribute, the first cost spike you can’t explain, and the first time you need to restrict one team’s agent from calling a tool another team owns.
During active server development, direct connections make sense too. When you’re iterating on the MCP server implementation itself, adding a gateway in the middle of the stack slows feedback without adding value. Ship a production-ready MCP server first, add the gateway when you start adding agents.
Start Where the Pain Is
You don’t have to implement everything at once. If auth is blocking adoption, deploy the MCP gateway with OAuth 2.1 enforcement first — leave rate limiting and observability for the second sprint. If a runaway agent is the immediate problem, add the token-based rate limiter first.
Each capability — session routing, token exchange, rate limiting, trace injection — is independently deployable. The architecture here works incrementally: pick the capability that unblocks your current bottleneck, deploy it, and add the next one when the next problem surfaces.
There are now 17,468 public MCP servers across all registries. The infrastructure to govern them at production scale exists today. The question is whether you put it in place before or after the first incident.