Specification Engineering: Ship Production Code

Your AI coding agent isn’t the problem. Your inputs are.

That’s the uncomfortable truth behind a striking paradox: 84% of developers now use or plan to use AI coding tools (Stack Overflow 2025 Developer Survey), yet trust in AI-generated code has collapsed from 40% to 29% in a single year — with only 3% of developers saying they highly trust what their agent produces. Meanwhile, 66% of developers report spending more time fixing “almost-right” AI-generated code than it would have taken to write it themselves.

The gap between AI’s potential and AI’s actual output is a specification engineering problem. Developers who treat AI coding agents as conversational chat partners get conversational-quality code. Developers who treat them as structured, context-driven systems get production-grade output. This post shows you how to make that shift — with before/after examples, a reusable anatomy, and copy-paste templates you can drop into any agent workflow today.

The Real Reason Your AI Code Keeps Missing the Mark (It’s Not the Model)

When Qodo asked developers what frustrated them most about AI coding tools, 54% said the AI still misses relevance even after they manually select context. But here’s the number that matters: that frustration drops to 16% when context is persistently stored and reused across sessions.

The implication is direct. The model isn’t the bottleneck — the context layer is.

CodeRabbit’s 2025 research quantifies the downstream damage: AI-generated pull requests contain approximately 1.7x more issues than human-written PRs. Logic and correctness issues are up 75%. Security vulnerabilities are up 1.5–2x. Code readability problems have increased more than 3x.

These aren’t model failures. They’re specification failures — the AI made reasonable guesses about constraints and requirements that weren’t stated, and got them wrong.

A GitHub analysis of 2,500+ agent instruction files found that most fail for a single reason: insufficient specificity about inputs, outputs, and constraints. The model does exactly what it’s told. The problem is that most developers aren’t telling it enough.

Prompt Engineering vs. Specification Engineering — What’s Different

Prompt engineering is a conversational skill. It’s about crafting the right words in the moment — clever phrasing, chain-of-thought tricks, few-shot examples in a single exchange. It’s also ephemeral: it degrades across sessions, doesn’t transfer to teammates, and produces non-deterministic results because every conversation starts from scratch.

Specification engineering is a different discipline entirely. It’s the structured practice of authoring machine-readable project context — persistent, versioned documents that live in your repository and travel with every agent session. Where prompt engineering optimizes a single exchange, specification engineering defines the ground truth for every interaction.

Think of it this way: a skilled contractor doesn’t succeed because they’re good at interpreting vague instructions on the fly. They succeed because they start with blueprints.

“Most agent instruction files fail because they are too vague — the most common failure mode is insufficient specificity about inputs, outputs, and constraints.” — GitHub Spec Kit research, cited by Addy Osmani

The shift from prompt engineering to specification engineering is the defining skill change of the AI-coding era. It’s not a writing exercise. It’s a technical discipline where spec quality is the primary variable that determines code quality.

Before vs. After: How Spec Quality Directly Determines Code Quality

The most direct proof is a side-by-side comparison. Here’s the same feature request — a user authentication endpoint — written two ways.

The vague prompt (what most developers send)

“`

Add a login endpoint to the API that takes a username and password.

“`

What the AI produces: a POST `/login` route that accepts a JSON body, queries a hardcoded User model, and returns a JWT. It skips rate limiting, brute-force protection, and audit logging entirely. It looks plausible. It will not survive a security review.

The structured spec

“`

Feature: User Authentication Endpoint

Goal: Add a POST /auth/login endpoint that authenticates registered users

and returns a short-lived JWT.

System boundary: Auth service only. No changes to user registration.

Data contract:

Input: { email: string (RFC 5321), password: string (8–72 chars) }

Success (200): { token: string (JWT, 15min expiry),

refresh_token: string (7-day, httpOnly cookie) }

Failure (401): { error: “invalid_credentials” }

— same message whether email or password is wrong

Constraints:

Rate limit: 5 attempts per IP per minute (use existing RateLimiter middleware)
Passwords must be verified with bcrypt — never plain comparison
Log every attempt (success and failure) to audit_log with timestamp and IP

Acceptance criteria:

401 on wrong email; 401 on wrong password — identical response, no enumeration
429 after 5 failed attempts in 60 seconds from the same IP
Token validates correctly against /auth/verify
Audit log row exists for every call

“`

The AI-generated code from the structured spec includes rate-limiting middleware, bcrypt comparison, identical error responses for failed email and password (preventing enumeration attacks), and audit logging. It’s production-grade on the first pass — not after three rounds of review comments.

That gap is what CodeRabbit’s 1.7x defect-rate finding is measuring. The model isn’t less capable. The model had less to work with.

The Anatomy of an AI-Ready Spec — Six Components Every Agent Needs

Not every component belongs in every spec. But understanding what each one does helps you know when to include it.

1. High-level intent

One or two sentences. What is this feature trying to accomplish, and for whom? Agents use this to disambiguate edge cases — when behavior isn’t explicitly specified, intent acts as a tiebreaker.

2. System boundaries

What is in scope? What is explicitly out of scope? Without boundaries, agents make unsanctioned architectural decisions — adding a caching layer you didn’t want, or touching a module you didn’t intend.

3. Data contracts

Precise input and output shapes: field names, types, validation rules, and example values. This is the highest-leverage section in any spec. Qodo’s research found that 26% of developers cite “improved contextual understanding” as their most-wanted AI fix — data contracts are the fastest way to deliver that.

4. Behavioral constraints

Performance budgets, required libraries, security requirements, architectural patterns to follow. Constraints prevent the plausible-but-wrong implementation that passes at a glance but breaks under load or audit.

5. Error-handling rules

Explicit error states, expected HTTP status codes, user-facing messages, and logging requirements. Agents left to infer error behavior default to generic patterns that are almost never right for your domain.

6. Acceptance criteria

Testable conditions the implementation must satisfy. Written as a checklist or as Given/When/Then statements, these become the agent’s definition of done — and what you verify before approving any AI-generated PR.

Persistent Context Files: Writing CLAUDE.md and AGENTS.md That Work

A per-feature spec is powerful. A persistent context file that applies to every session is transformational.

Tools like Claude Code, Cursor, and Gemini CLI look for project-level context files — `CLAUDE.md`, `AGENTS.md`, `GEMINI.md` — and load them automatically at the start of every session. Most articles treat these as a footnote. They are the architectural foundation of specification engineering.

A well-written `CLAUDE.md` gives the agent standing instructions it doesn’t need to be told every time:

“`markdown

# Project: Acme API

Stack

Runtime: Node.js 22, TypeScript 5.4
Framework: Fastify (not Express — do not suggest Express)
ORM: Drizzle (not Prisma)
Auth: JWT (15min) + refresh token (7-day, httpOnly cookie)
Testing: Vitest + Supertest for integration

Coding conventions

Validate all handler inputs with Zod before processing
All DB queries go through the repository layer — never raw SQL in handlers
Errors must extend AppError (src/errors/AppError.ts)
Every public function needs a JSDoc comment

Always do

Run `npm run typecheck` before declaring a task complete
Use the audit logging utility (src/utils/audit.ts) for any auth-related action

Never do

Add dependencies without asking
Modify migration files — create new ones
Bypass the RateLimiter middleware

“`

For teams, `AGENTS.md` serves the same role with added governance: which engineers own which modules, what approval is required before certain files are modified, and how cross-module changes should be coordinated. This is the spec layer that makes AI-assisted development work across more than one developer prompting the same agent.

The payoff is measurable. Context frustration drops from 54% to 16% when context is persistently stored and reused — exactly what these files accomplish.

Spec-Driven Development in Practice — The Four-Stage Workflow

Spec-driven development (SDD) treats the spec as the source of truth, not an afterthought created after the code already exists.

Stage 1: Specification

Write the feature spec using the six-component anatomy. This happens before any code is generated. Store it in your repo — conventionally in a `specs/` or `.agent/` directory — so it’s versioned alongside the code it describes.

Stage 2: Technical plan

Hand the spec to your AI agent and ask it to produce a technical plan: which files will change, which will be created, which dependencies are involved, and what the implementation sequence should be. Review and approve this plan before generation begins.

Stage 3: Task decomposition

Break the technical plan into discrete tasks — each small enough to produce a reviewable diff. Tools like GitHub’s Spec Kit, Kiro, and the BMAD Method all operationalize this stage. The practical constraint: no single task should generate more than 200–300 lines of new code.

Stage 4: Implementation and verification

Generate code one task at a time and verify each output against the spec’s acceptance criteria before moving to the next. Catching drift early is critical — don’t batch ten tasks and review them at the end.

The “waterfall objection” surfaces here regularly: isn’t writing a spec before coding waterfall with extra steps? No — because specs are living documents. When requirements change, update the spec first, then regenerate. Versioning specs alongside code means you always know whether the implementation reflects the current requirements, and you can diff spec changes the same way you diff code.

Copy-Paste Spec Templates for Three Common Scenarios

These templates are minimal but complete. Fill in the brackets and drop them into any AI coding agent.

REST Endpoint

“`markdown

Spec: [HTTP Method] /[path]

Goal: [One sentence — what this endpoint enables for which user]

Auth: [Required role/token type] | None

Input:

[field]: [type] — [validation rule]

Success response ([status]):

[field]: [type]

Error responses:

[status]: [condition] → { error: “[code]” }

Constraints:

[Rate limiting / caching / middleware requirements]
[Libraries to use or avoid]

Acceptance criteria:

[ ] [Testable condition]
[ ] [Edge case that must be handled]
[ ] [Security or performance requirement]

“`

UI Component

“`markdown

Spec: [ComponentName]

Goal: [What the component does and where it lives in the UI]

Props:

[propName]: [type] — [required/optional, default if optional]

States: [idle | loading | error | empty | populated]

Behavior:

On [event]: [what happens]
On error: [how errors surface to the user]

Constraints:

Accessibility: [ARIA roles, keyboard nav requirements]
Styling: [Tailwind / CSS modules / etc.]
Do not fetch data inside this component — accept data as props

Acceptance criteria:

[ ] Renders correctly in all states
[ ] Keyboard navigable
[ ] Matches design spec at [breakpoints]

“`

Background Job

“`markdown

Spec: [JobName]

Goal: [What this job accomplishes and why it runs on a schedule]

Trigger: [Cron expression] | [Event-driven trigger]

Input source: [Database query / queue message / API call]

Processing logic:

[Step one]
[Step two]

Output / side effects:

[What gets written, published, or sent]

Error handling:

On failure: [retry policy — max attempts, backoff]
On partial failure: [checkpoint behavior]
Alerting: [how failures surface]

Acceptance criteria:

[ ] Idempotent — safe to re-run without duplicate side effects
[ ] Processes [N] records per run without timeout
[ ] Failure logged with job ID and record ID

“`

When to Write a Spec and When a Prompt Is Enough — A Decision Framework

Specs have a real cost. A minimal spec written well takes 15–30 minutes. The goal isn’t to spec everything — it’s to spec the things where that investment compounds.

Write a spec when:

The feature will go to production and be maintained
The work spans multiple agent sessions or days
More than one engineer will prompt agents on related work
The feature touches security, payments, auth, or data integrity
Bugs would be expensive to find after the fact

A conversational prompt is fine when:

You’re exploring a new library or API to understand its shape
The task is genuinely one-off — a migration script that runs once, a data fix
You’re prototyping something that will be thrown away
The feature is low-risk and purely additive with obvious inputs and outputs

The simplest signal: if you’d write a ticket for it, write a spec for it. If you’d handle it in a Slack message, a prompt is probably fine.

Conclusion

Specification engineering for AI coding isn’t about doing more work before writing code — it’s about front-loading the thinking that previously scattered itself across prompt-tweaking sessions, review comments, and post-merge bug reports. With 42% of all committed code in 2025 already AI-assisted (Sonar, 2025/2026) and AI-generated PRs carrying 1.7x more defects than human-written ones, closing that gap doesn’t require better models. It requires better inputs.

Start with one spec. Pick the next feature you’re about to build, spend 20 minutes with the six-component anatomy above, and compare the output to what you’d have gotten from a vague prompt. The difference will be obvious — and repeatable.

Pick one of the three templates above, write your first structured spec today, and see what your AI agent does with real context to work from.

The Real Reason Your AI Code Keeps Missing the Mark (It’s Not the Model)

Prompt Engineering vs. Specification Engineering — What’s Different

Before vs. After: How Spec Quality Directly Determines Code Quality

The vague prompt (what most developers send)

The structured spec

Feature: User Authentication Endpoint

The Anatomy of an AI-Ready Spec — Six Components Every Agent Needs

Persistent Context Files: Writing CLAUDE.md and AGENTS.md That Work

Stack

Coding conventions

Always do

Never do

Spec-Driven Development in Practice — The Four-Stage Workflow

Copy-Paste Spec Templates for Three Common Scenarios

REST Endpoint

Spec: [HTTP Method] /[path]

UI Component

Spec: [ComponentName]

Background Job

Spec: [JobName]

When to Write a Spec and When a Prompt Is Enough — A Decision Framework

Conclusion

Leave a Reply Cancel reply

Related Posts

AI Agent Observability: Trace & Debug with Langfuse

Cognitive Debt: The Hidden Cost of AI-Generated Code

The True Cost of Self-Hosting AI in 2026: A Line-by-Line TCO Breakdown

MCP Server AI Code Security: Catch Flaws Before Commit