Eval-Driven Development for AI Agents: Practical Guide

Your AI agent passes every spot check. You demo it to the team, it handles the happy path flawlessly, and you ship it. Three days later, a user reports it called the wrong tool on an ambiguous query — a query you never thought to test. Sound familiar? Eval-driven development for AI agents is how you break that cycle for good.

This guide gives you a concrete, code-first implementation using DeepEval and LangSmith. By the end, you’ll have a working eval framework you can scaffold into your project today, a CI/CD integration that catches regressions on every pull request, and a clearer mental model for thinking about agent quality systematically.

Why Traditional TDD Breaks for AI Agents (And What to Do Instead)

Traditional test-driven development rests on one critical assumption: given the same inputs, your code always produces the same outputs. Write a test, watch it fail, make it pass, never touch it again until the logic changes. That contract works beautifully for deterministic systems.

LLM agents break it completely.

The same prompt, sent twice, can produce different tool call sequences, different arguments, and different final answers — all while being “correct” by any reasonable definition. Worse, a new model version might score better on benchmarks while silently regressing on your specific use case. LLM performance on the SWE-bench Verified coding benchmark jumped from roughly 40% to over 80% in a single year ([Anthropic Engineering](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents)). That pace of change is exciting and terrifying: without a systematic eval suite, every model upgrade is a gamble.

What you need instead is continuous, probabilistic evaluation — running your test suite repeatedly, tracking pass rates over a distribution of inputs, and treating a score of 85% on TaskCompletion as meaningful signal rather than demanding binary pass/fail as the only acceptable output.

What Is Eval-Driven Development? The Red-Green-Refactor Cycle for Agentic AI

Eval-Driven Development (EDD) is TDD translated into probabilistic terms. The workflow mirrors the classic Red-Green-Refactor cycle:

  1. Red — Write eval test cases before you write agent code. Define what success looks like: which tools should be called, in what order, with what arguments, and what the final answer should achieve.
  2. Green — Implement the agent. Run your evals. Iterate on your prompt, tool definitions, and model choice until your eval suite passes at an acceptable threshold.
  3. Refactor — When you upgrade a model, change a prompt, or add a new capability, re-run the suite and verify your pass rate held.

Anthropic explicitly recommends this approach as an official practice: “Build evals to define planned agent capabilities before the agent can fulfill them, then iterate until it passes” ([Anthropic Engineering](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents)). Writing evals first forces you to articulate success criteria unambiguously — the exact discipline that prevents most spec-level bugs before a line of agent code exists.

The stakes are real. 32% of teams cite quality as the single biggest blocker to getting AI agents into production, ahead of latency (20%) and security ([LangChain State of Agent Engineering Report, 2025–2026](https://www.langchain.com/state-of-agent-engineering)). EDD directly addresses that blocker by making quality measurable from day one.

Choosing Your Eval Stack: How DeepEval and LangSmith Work Together

Before writing any code, it helps to understand what each tool does — and why you need both.

DeepEval is a code-first evaluation framework. It gives you pre-built metrics designed specifically for agentic workflows: did the agent call the right tools? Did it pass the right arguments? Did it complete the task? You write test cases in Python, run them with `deepeval test run`, and get a structured report. As of early 2026, DeepEval has surpassed 3 million monthly PyPI downloads and 10,000+ GitHub stars, making it the most-downloaded LLM evaluation framework ([Confident AI](https://www.braintrust.dev/articles/deepeval-alternatives-2026)).

LangSmith handles observability, tracing, and dataset management. It captures every step of your agent’s execution — which tools were called, what the inputs and outputs were, how long each step took — and surfaces that data in a queryable UI. Critically, it lets you promote real production failures into your eval dataset with a few clicks.

The combination is powerful: LangSmith catches what’s failing in production; DeepEval measures whether your fixes work. One handles the “what happened,” the other handles “was it good enough.”

Step 1 — Write Your Evals Before You Write Agent Code

Start by defining your test cases. Resist the urge to write the agent first and figure out tests afterward — that approach guarantees you’ll write tests that confirm what you built rather than validate what you intended.

For a baseline dataset, start with 20–50 examples drawn from real failures — not synthetic edge cases invented in a planning doc. Mine your existing systems: failed support tickets, ambiguous queries from user research sessions, edge cases your team has argued about in Slack. Real failures expose the actual distribution of inputs your agent will face.

Each test case should specify:

  • Input — the user query or task
  • Expected tools — which tools the agent should invoke, in what order
  • Expected arguments — what parameters those tools should receive
  • Success criteria — what the final output should achieve (semantically, not word-for-word)

Here’s a minimal test case structure in DeepEval:

“`python

from deepeval.test_case import LLMTestCase, ToolCall

from deepeval.metrics import ToolCorrectnessMetric, TaskCompletionMetric

test_case = LLMTestCase(

input=”Find all open issues assigned to me and summarize them”,

actual_output=”…”, # filled by your agent during test run

expected_tools=[

ToolCall(name=”list_issues”, input_parameters={“assignee”: “me”, “state”: “open”}),

ToolCall(name=”summarize_text”)

]

)

“`

Notice there’s no agent code yet. Writing this test case first forces a real decision: should `list_issues` receive `assignee=”me”` or should the agent resolve the user’s identity first? Answering that question before building the agent saves hours of debugging later.

Step 2 — Instrument Your Agent with DeepEval

Once your test cases exist, you build the agent and attach DeepEval’s metrics. The four core agent metrics are:

  • `ToolCorrectnessMetric` — Did the agent invoke the right tools?
  • `ArgumentCorrectnessMetric` — Did it pass the right arguments to those tools?
  • `TaskCompletionMetric` — Did it complete the intended task?
  • `PlanQualityMetric` — Was the overall plan (sequence of steps) reasonable and efficient?

Here’s how a full test looks wired up with the `@log_hyperparameters` decorator for tracing:

“`python

import deepeval

from deepeval import assert_test

from deepeval.test_case import LLMTestCase, ToolCall

from deepeval.metrics import (

ToolCorrectnessMetric,

ArgumentCorrectnessMetric,

TaskCompletionMetric,

)

@deepeval.log_hyperparameters(model=”gpt-4o”, prompt_version=”v2″)

def test_issue_summary_agent():

result = run_agent(“Find all open issues assigned to me and summarize them”)

test_case = LLMTestCase(

input=”Find all open issues assigned to me and summarize them”,

actual_output=result.final_answer,

tools_called=result.tool_calls,

expected_tools=[

ToolCall(name=”list_issues”, input_parameters={“assignee”: “me”, “state”: “open”}),

ToolCall(name=”summarize_text”)

]

)

assert_test(test_case, [

ToolCorrectnessMetric(threshold=0.8),

ArgumentCorrectnessMetric(threshold=0.7),

TaskCompletionMetric(threshold=0.8),

])

“`

Run this with `pytest` or `deepeval test run`. You’ll get a score for each metric, a breakdown of which tool calls were wrong, and — because you logged hyperparameters — a record of which model version and prompt produced those results. That record is what makes regressions traceable across model upgrades.

Choosing the right reliability metric: pass@k vs pass^k

For reliability testing — running the same input multiple times to assess consistency — your choice of metric matters:

  • pass^k (all-or-nothing): the agent must succeed on all k runs. Use this for customer-facing flows where a single failure is unacceptable — booking a flight, executing a payment, sending an email on someone’s behalf.
  • pass@k (at-least-one): the agent passes if it succeeds on at least one of k runs. Use this when testing capability (can the agent ever solve this?) rather than reliability.

Conflating these two is a common mistake. An agent that scores 95% on pass@k might be failing 40% of individual runs — catastrophic for a customer-facing workflow, fine for a research assistant.

Step 3 — Add LangSmith Tracing to Capture Failures as New Test Cases

DeepEval tells you how your agent scores against known test cases. LangSmith tracing tells you what’s failing in production that you haven’t written test cases for yet.

Instrument your agent with LangSmith’s tracing decorator:

“`python

from langsmith import traceable

from langsmith.wrappers import wrap_openai

from openai import OpenAI

client = wrap_openai(OpenAI())

@traceable(name=”issue-summary-agent”)

def run_agent(user_query: str) -> AgentResult:

response = client.chat.completions.create(

model=”gpt-4o”,

messages=[{“role”: “user”, “content”: user_query}],

tools=your_tools

)

return process_response(response)

“`

With tracing active, every production run streams to LangSmith automatically. From the UI, you can filter runs by failure tags, click “Add to Dataset” on any traced run to capture its exact input and output as a new test case, and set up automated evaluators that score production traces in real time.

The feedback loop looks like this: a user reports a failure → you find the trace in LangSmith → you add it to your DeepEval dataset → you write a failing test → you fix the agent → the test passes → you ship.

Only 52.4% of AI engineering teams run offline evaluations on test sets, yet 89% have implemented observability ([LangChain State of Agent Engineering Report](https://www.langchain.com/state-of-agent-engineering)). That gap means most teams already have the data they need to improve their agents — they’re just not converting it into test cases. EDD bridges exactly that gap.

Step 4 — Integrate Your Eval Suite into CI/CD

An eval suite you run manually once a sprint is not eval-driven development — it’s periodic testing. The practice only becomes continuous when your evals run automatically on every pull request and model change.

Here’s a minimal GitHub Actions workflow:

“`yaml

name: Agent Eval Suite

on:

pull_request:

branches: [main]

workflow_dispatch:

jobs:

eval:

runs-on: ubuntu-latest

steps:

  • uses: actions/checkout@v4
  • name: Set up Python

uses: actions/setup-python@v5

with:

python-version: “3.11”

  • name: Install dependencies

run: pip install deepeval langsmith pytest

  • name: Run eval suite

env:

OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

LANGCHAIN_API_KEY: ${{ secrets.LANGCHAIN_API_KEY }}

DEEPEVAL_API_KEY: ${{ secrets.DEEPEVAL_API_KEY }}

run: deepeval test run tests/evals/

  • name: Upload results

if: always()

uses: actions/upload-artifact@v4

with:

name: eval-results

path: .deepeval/

“`

A few practical notes on running evals in CI:

  • Set threshold budgets, not binary pass/fail. A `ToolCorrectnessMetric` threshold of 0.75 on your baseline is reasonable. If a PR drops it below 0.60, block the merge.
  • Run on a fixed test set. Don’t randomly sample in CI — you want deterministic gate conditions. Reserve randomized adversarial testing for scheduled weekly runs.
  • Trigger on model config changes, not just code changes. If your model version or system prompt lives in a config file, add a path filter so that changing `config/models.yaml` also triggers the eval suite.

With this setup, every engineer gets immediate feedback when their changes break agent behavior — not three days later when a user files a bug report.

Common Pitfalls and How to Avoid Them

Even teams that adopt EDD often make mistakes in how they build their eval suites. Here are the ones that cause the most damage.

One-sided eval sets. If your test cases only cover the happy path, you’ll get false confidence. Deliberately include adversarial inputs: ambiguous queries, missing context, conflicting instructions. Aim for at least 30% of your dataset to be “hard” cases.

Graders that don’t catch failures. A team’s eval score on CORE-Bench jumped from 42% to 95% simply by fixing bugs in the evaluation harness itself ([Anthropic Engineering](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents)). If your `TaskCompletionMetric` threshold is set too low, or your expected tool definitions are underspecified, the metric will pass on wrong answers. Audit your graders as carefully as you audit your agent code.

Evaluating path instead of outcome. Requiring an exact tool call sequence makes tests brittle and can mask genuinely good solutions that take a different route. Use `PlanQualityMetric` to assess whether the approach was reasonable — not to demand a specific sequence. Reserve strict path assertions for deterministic sub-tasks where there is only one correct sequence.

Vague task specifications. “Summarize the issues” is not a test case. “Return a bullet-point summary of all open issues assigned to the authenticated user, including issue ID and priority, sorted by priority descending” is. The more precise your specification, the more useful your eval results — and the harder it is for your agent to appear to pass while doing something wrong.

Normalizing the absence of evals. According to the same LangChain report, 22.8% of teams with agents already in production still run no formal evaluations. Don’t benchmark yourself against that. Benchmark against the teams that catch regressions before their users do.

Start Building Your Eval-Driven Development Practice Today

Eval-driven development for AI agents isn’t a theoretical ideal — it’s a practical discipline you can implement this week.

Start with 20 real failure cases. Write DeepEval test cases that define what “correct” looks like before you touch the agent. Wire up LangSmith tracing to capture the failures you haven’t anticipated yet. Add a CI step so your next pull request runs the suite automatically.

The teams shipping reliable agents aren’t doing more manual testing. They’re the ones who automated their quality signal early — and let that signal drive every decision from model selection to prompt iteration.

Pick one agent in your system. Write three test cases for it before you change another line of code. That’s your red. Now make it green.

Leave a Reply

Your email address will not be published. Required fields are marked *