Pydantic AI Agent Tutorial: Build for Production

Your production agent works fine locally. It hits every test case, the demo runs clean, and the LLM reliably returns the right data. Then you ship it — and three days later you’re staring at a KeyError buried six abstraction layers deep, trying to figure out why your agent returned "{'result': None}" as a string instead of an actual object.

This isn’t a fringe scenario — it’s the exact production gap a Pydantic AI agent framework solves. A 2025 AI developer survey found that 45% of developers who experiment with LangChain never use it in production, and 23% of teams that did ship with it eventually removed it entirely. The root cause is almost always the same thing: no type enforcement at the LLM boundary. Outputs are raw strings, parsing is manual, and failures surface late — far from where they originate.

Pydantic AI is built to fix this at the framework level. If you’ve used FastAPI, the mental model transfers immediately.

This tutorial builds a production-worthy agent from scratch: structured typed outputs, dependency injection for clean unit testing, real API tools with automatic argument validation, and local testing that burns zero API credits.

The Production Problem With LangChain Agents (and Why Type Safety Is the Fix)

The LangChain failure mode is specific and repeatable. Your chain returns an AIMessage. You call .content, get a string that looks like JSON, and call json.loads() on it — or rely on an output parser that guesses the schema. The LLM cooperates 94% of the time. The other 6% it returns valid JSON with a missing key, and your downstream code either silently fails or throws a KeyError that points nowhere near the actual problem.

No type-checker catches this at write time. No IDE warns you. The error surfaces in production, under real prompts, with real users.

LangChain’s abstraction layers compound the cost in measurable ways. Independent benchmarks found LangChain consumes 2.7× more tokens than direct API implementations for equivalent tasks, and its memory wrapper adds over one second of latency per call.

One team cut costs by roughly 30% after replacing LangChain’s default memory with a trimmed custom solution. These aren’t edge cases — they’re the natural consequence of wrapping everything in everything else.

Understanding why LangChain creates these production failure modes is the first step to choosing a better path. Pydantic AI’s answer is structural: use Python’s type system to enforce contracts at the LLM boundary, exactly the way FastAPI enforces them at the HTTP boundary. If the model returns something invalid, the framework asks it to retry before your code ever sees the response.

Pydantic AI Agent in 5 Minutes: The FastAPI Mental Model for LLM Agents

If you’ve written a FastAPI route, you already know this pattern:

@router.post("/search")
async def search(query: SearchQuery) -> SearchResult:
    ...

FastAPI reads the type hints, validates the request body against SearchQuery, and validates your return value against SearchResult. Invalid input raises a 422 before your handler runs. Wrong return type raises immediately.

Pydantic AI applies the same idea to LLM agents:

from pydantic import BaseModel
from pydantic_ai import Agent

class ResearchResult(BaseModel):
    title: str
    key_findings: list[str]
    confidence_score: float
    sources: list[str]

agent = Agent(
    'openai:gpt-4o',
    result_type=ResearchResult,
    system_prompt="You are a research assistant. Return structured findings.",
)

The full generic signature is Agent[Deps, OutputType]. The framework validates every LLM output against ResearchResult using Pydantic. If the model returns invalid data, it automatically re-prompts with the validation error as context — the reflection loop — before surfacing any failure to your application code.

The Pydantic team built this. The parent library has over 10 billion total downloads and 550 million monthly downloads — it’s the validation layer inside the OpenAI SDK, Anthropic SDK, Google ADK, and LangChain itself. That authority is structural, not claimed. Pydantic AI hit 16,400+ GitHub stars by April 2026, just 18 months after launch, with 8 million monthly downloads in 2025.

Before committing to a single model provider, it’s worth understanding how multi-model AI stacks behave under real workloads — Pydantic AI’s model abstraction makes swapping providers a one-line change, which matters when you’re optimizing cost vs. capability across agent tasks.

Structured Outputs: Replacing Fragile String Parsing With Validated Pydantic Models

The practical difference shows up the moment you call the agent:

result = agent.run_sync("Summarize recent findings on transformer attention mechanisms")

# Fully typed — no cast, no dict access, no guessing
print(result.data.confidence_score)   # float
print(result.data.key_findings[0])    # str
for source in result.data.sources:    # list[str]
    print(source)

Compare this to the LangChain pattern, where you’re on your own after the chain returns:

# LangChain: parsing is your problem
output = chain.invoke({"query": "Summarize transformer attention..."})
data = json.loads(output["text"])          # KeyError if key changed
score = float(data["confidence_score"])    # TypeError if LLM returned "0.85" in quotes

The Pydantic AI version isn’t just more readable — it’s a different failure class entirely. Your type-checker catches schema mismatches at write time. Your IDE autocompletes result.data. with the actual field names.

The reflection loop is the quiet feature that makes this robust at scale. When the LLM returns invalid data, Pydantic AI feeds the full ValidationError details back to the model and requests a corrected response. In a controlled benchmark, this approach caught 23 bugs during development that would have reached production in LangChain, CrewAI, or AutoGen. Configure retries via retries=2 or retries=3 on the agent constructor for production use.

Dependency Injection With RunContext: Building Agents You Can Actually Unit Test

Most tutorials skip this feature. It’s also the one that makes production agents maintainable beyond the first week.

Define your dependencies as a dataclass:

from dataclasses import dataclass
from pydantic_ai import Agent, RunContext

@dataclass
class AgentDeps:
    db: DatabaseClient
    cache: RedisClient
    user_id: str

agent = Agent(
    'openai:gpt-4o',
    deps_type=AgentDeps,
    result_type=ResearchResult,
)

Tools receive a RunContext[AgentDeps] as their first argument. Dependencies are accessed through ctx.deps — no global imports, no module-level state, nothing to monkey-patch:

@agent.tool
async def fetch_user_context(ctx: RunContext[AgentDeps]) -> dict:
    """Retrieve the user's saved research topics and preferences."""
    prefs = await ctx.deps.db.get_user_prefs(ctx.deps.user_id)
    cached = await ctx.deps.cache.get(f"context:{ctx.deps.user_id}")
    return {"preferences": prefs, "recent_topics": cached or []}

For production, inject real clients. For tests, inject stubs:

# Production
deps = AgentDeps(
    db=PostgresClient(DATABASE_URL),
    cache=RedisClient(REDIS_URL),
    user_id=current_user.id
)

# Tests — no mocking library needed
deps = AgentDeps(
    db=FakeDatabase(seed_data=TEST_FIXTURES),
    cache=InMemoryCache(),
    user_id="test-user-123"
)

The agent doesn’t know or care which implementation it received, because the type contract is what matters. This is the same dependency injection pattern that makes FastAPI apps easy to test — Pydantic AI brings it to the agent layer where it’s been missing from every major framework.

Adding Tools: Letting Your Agent Call Real APIs With @agent.tool

The @agent.tool decorator works like @router.get in FastAPI. The function’s docstring becomes the tool description sent to the LLM. Pydantic validates every argument — if the model calls the tool with an unexpected type, it receives a validation error and retries.

@agent.tool
async def search_arxiv(
    ctx: RunContext[AgentDeps],
    query: str,
    max_results: int = 5,
    date_filter: str | None = None
) -> list[dict]:
    """
    Search arXiv for academic papers.

    Args:
        query: Search terms or paper title keywords.
        max_results: Number of results to return (1-20).
        date_filter: Optional ISO date to filter papers published after this date.
    """
    params = {"query": query, "max_results": max_results}
    if date_filter:
        params["start_date"] = date_filter
    return await arxiv_client.search(**params)

The docstring is not decoration — it’s the schema. Write it precisely and the LLM uses the tool correctly. Write it vaguely and you’ll see the model struggle with when and how to invoke it.

For tools that don’t need dependency access, @agent.tool_plain works identically but takes no RunContext argument. Use it for pure utility functions.

Testing Without Burning API Credits: TestModel, FunctionModel, and Pydantic Evals

This capability is almost entirely absent from beginner tutorials, and it may be the most production-relevant feature in the framework.

TestModel is a drop-in replacement that returns minimal valid responses without making any API calls. Use it to verify tool-calling logic, validation behavior, and dependency wiring:

from pydantic_ai.models.test import TestModel

def test_search_tool_is_invoked():
    with agent.override(model=TestModel()):
        result = agent.run_sync(
            "Find papers on attention mechanisms",
            deps=AgentDeps(
                db=FakeDatabase(),
                cache=InMemoryCache(),
                user_id="test-123"
            )
        )
    assert result.data is not None

FunctionModel is the power tool. You control exactly what the “LLM” returns, which means you can test every edge case in your validation and reflection logic:

from pydantic_ai.models.function import FunctionModel, ModelContext

def test_validation_failure_triggers_retry():
    call_count = 0

    def flaky_model(messages: list, info: ModelContext):
        nonlocal call_count
        call_count += 1
        if call_count == 1:
            return '{"title": "Test", "confidence_score": "not-a-float"}'  # invalid
        return '{"title": "Test", "key_findings": [], "confidence_score": 0.9, "sources": []}'

    with agent.override(model=FunctionModel(flaky_model)):
        result = agent.run_sync("test", deps=test_deps)

    assert call_count == 2  # reflection loop fired once
    assert result.data.confidence_score == 0.9

Combined with dependency injection, this gives you full agent test coverage that runs in milliseconds. No API credits, no network calls, no flaky tests that depend on model behavior.

Independent benchmark developer experience scores: Pydantic AI 8/10, LangChain 5/10. The testing story is a substantial part of that gap.

Observability With Logfire: Tracing Every LLM Call, Tool, and DB Query in Production

Getting an agent to work locally is one problem. Understanding why it failed at 3am in production is a different one entirely.

Logfire is Pydantic’s OpenTelemetry-based observability platform. The Pydantic AI integration is two lines:

import logfire

logfire.configure()               # reads LOGFIRE_TOKEN from env
logfire.instrument_pydantic_ai()  # done

From that point, every agent run creates a structured trace covering LLM calls with full prompt and response bodies, tool invocations with individual latency, validation failures with the exact error context, and any downstream services you instrument separately (asyncpg, httpx, SQLAlchemy all have first-party integrations).

The free tier includes 10 million spans per month with no credit card. For most teams in early production, that’s sufficient to establish performance baselines, catch output regressions, and debug failures that only appear under real user load.

The critical difference from log lines is that Logfire traces are queryable by structured attributes — tool_name, model, user_id, validation_error. When your pydantic ai agent starts returning unexpected results in production, you filter by the relevant attribute and see the exact prompt-response cycle that caused it. That’s a completely different debugging experience from adding print statements inside your framework’s source code.

Given that 87% of developers report concerns about AI agent accuracy — and confidence in fully autonomous agents fell from 43% to 22% between 2024 and 2025 — structured observability is not an optional add-on. It’s the difference between a system you can reason about and one you’re guessing at.

Next Steps: Multi-Agent Patterns, Durable Execution, and Deploying to AWS Bedrock

The agent you’ve built here is a solid, testable, observable single-agent system. Where you go next depends on your use case.

Multi-agent workflows in Pydantic AI work by passing agents as tools to other agents. A coordinator agent receives user intent, delegates subtasks to specialized sub-agents (research, summarization, writing), and aggregates their typed outputs. Because every boundary is typed, schema mismatches surface at development time.

Teams building concurrent agent pipelines need isolation between parallel runs — git worktrees give you this cleanly during development so multiple agent instances can operate on the same codebase without branch conflicts.

Durable execution — for long-running workflows that need to survive process restarts — pairs Pydantic AI with Temporal or AWS Step Functions. The agent handles LLM interaction and validation; the orchestrator handles persistence and retry logic.

AWS Bedrock and other cloud providers are supported via Pydantic AI’s model abstraction. Swapping from openai:gpt-4o to bedrock:anthropic.claude-3-5-sonnet-20241022-v2:0 changes one string. Your tools, validation, and dependency injection stay identical.

The framework reached stable v1.x API in September 2025. The patterns you’ve built here won’t break on the next pip install. Start with one agent, one structured output type, and one injected dependency. Add Logfire when you deploy. That foundation handles more production load than you’d expect — and it won’t surprise you at 3am.

Leave a Reply

Your email address will not be published. Required fields are marked *