DORA Metrics Break When AI Writes Half Your Code

Your deployment frequency is up. Lead time is down. The DORA dashboard is green. And somehow, your on-call rotation has never been worse.

If that sounds familiar, you’re not misreading your metrics — your metrics are misreading your team. The arrival of AI coding assistants like Copilot, Cursor, and Claude Code has introduced a quiet distortion into the classic DORA framework. Not because DORA is broken, but because it was designed to measure human engineers writing human code. When DORA metrics AI coding tools are mixed without accounting for AI attribution, throughput climbs while stability quietly collapses — a productivity illusion that looks great in a board deck and terrible at 2am on a Saturday.

This post breaks down exactly how each DORA metric gets distorted, what three independent research programs found when they looked at the data, and which three concrete metrics you can instrument — today — to get an honest picture of your team’s real performance.

The DORA Dashboard Lie — What ‘Green’ Looks Like When AI Writes Half the Code

Imagine two teams. Team A merged 200 PRs last month. Team B merged 400. By every DORA measure, Team B looks like an elite performer — higher deployment frequency, compressed lead time. But if 60% of Team B’s PRs came from Copilot suggestions and AI-generated boilerplate, the comparison is meaningless.

You’re not evaluating two engineering teams. You’re evaluating one engineering team and one engineering team with a code generation engine running continuously in the background.

This isn’t hypothetical. GitHub’s own data puts AI-generated code at 46% of all committed code across its platform. Sonar’s 2025 Developer Survey puts it at 42% via self-reporting. Either way, roughly half the code landing in your repository didn’t originate from an engineer thinking through a problem — it came from a model predicting the next token.

DORA metrics don’t know the difference. They count a merged PR as a merged PR. And if you’ve been trying to make sense of AI coding ROI in your existing dashboards, this is exactly why that number keeps feeling wrong.

How Each Classic DORA Metric Gets Distorted (And Why It’s Not DORA’s Fault)

The four classic DORA metrics each have a specific failure mode under AI. Understanding them individually is what lets you know which ones are still trustworthy and which need a denominator they don’t currently have.

Deployment Frequency gets inflated by sheer PR volume. AI tools make it trivially easy to open PRs for small, incremental changes. Teams using agentic AI — where autonomous agents open PRs, run tests, and trigger deploys without human initiation — can see deployment frequency numbers that would have been physically impossible two years ago. The metric can’t tell you whether those deployments represent meaningful progress or compounded technical debt landing in production.

Lead Time for Changes compresses upstream but jams downstream. AI makes the writing phase faster; it doesn’t make the review phase faster. Reviewers now face a higher volume of code they didn’t write and may not fully trust, creating a review bottleneck that DORA lead time calculations typically obscure by averaging across the full pipeline.

Change Failure Rate is the most dangerous distortion. AI-generated code passes AI-generated tests — both produced by the same model with the same blind spots. A test suite that an AI wrote to validate code an AI wrote will have high coverage and a clean CI run. It will also miss exactly the bug categories that AI models systematically miss.

Change failure rate looks stable. Production incidents don’t.

MTTR is actually the least distorted of the four. Recovery time is mostly a function of runbooks and incident response process, not code origin. The problem: while MTTR holds steady, incident frequency is soaring — and DORA doesn’t track incident frequency as a primary metric.

The Data Is In: Faros AI, GitClear, and the 2025 DORA Report All Found the Same Paradox

Three separate research programs reached the same uncomfortable conclusion from different directions.

Faros AI analyzed 22,000 developers across task management, version control, CI/CD, and incident management systems. The headline numbers from high-AI-adoption teams: PRs merged up 98%, bugs per developer up 54%, PR review time up 441%, incidents per PR up 242.7% — all simultaneously, on the same teams. The throughput looks elite. The stability picture is a quiet crisis.

GitClear analyzed 211 million changed lines of code across 2020–2024 and found code churn — lines revised within two weeks of authoring — nearly doubled from 5.5% to 7.9%. Copy/pasted code blocks increased 8x versus prior-year baselines. Neither metric appears anywhere on a standard DORA dashboard. They’re invisible distortions compounding quarter over quarter, steadily widening the gap between what your dashboard shows and what your engineers actually experience.

The DORA 2025 Report from Google Cloud, surveying ~5,000 technology professionals, confirmed the paradox at scale: higher AI adoption is statistically associated with higher software delivery instability. AI improves individual throughput by an estimated 2–18% while simultaneously increasing team-level instability. Critically, DORA’s own program responded by retiring its four-tier performance model (elite/high/medium/low) and introducing seven team archetypes, adding Rework Rate as a new core metric, and replacing MTTR with Failed Deployment Recovery Time.

The organization that created DORA is revising its own benchmarks in direct response to AI distortion. If they’re doing it, you should be too.

AI Code Share is conceptually simple: what percentage of merged code in a given period was AI-generated or AI-assisted? It’s a denominator. Without it, every other metric is uninterpretable.

Consider deployment frequency. If it doubles, is that because your engineers got more efficient, or because Copilot is autocompleting entire files? AI Code Share tells you.

If your change failure rate holds steady at 5% while AI Code Share climbs from 20% to 65%, you have a problem building under the surface — because AI-generated code at scale produces significantly more bugs. Aggregated research puts AI-generated code at 1.7x as many bugs as human-written code, with 75% more logic and correctness errors, totaling 194 incidents per 100 PRs.

Tracking AI Code Share requires tagging. The most practical approach is commit metadata: GitHub Copilot exposes telemetry through its API; Claude Code and Cursor can be integrated with pre-commit hooks that tag commits with tool attribution. DX’s TrueThroughput product automates this for mixed-source teams. Faros AI’s GAINS framework provides attribution tracking at the PR level.

You don’t need perfect attribution. Even rough tagging — manual self-reporting in PR descriptions, Slack prompts after Copilot-heavy sessions — gives you directional signal that immediately makes your DORA numbers more interpretable.

Metric 2 — Code Durability: Catching Architectural Fragility Before It Becomes a 3am Incident

Code Durability measures the percentage of AI-generated code that survives 30 days unmodified. It’s a proxy for architectural quality that operates on a fundamentally different timescale than any classic DORA metric.

GitClear’s churn data is the clearest evidence for why this matters. When code churn doubles, engineers are spending an increasing proportion of their time rewriting recently-written code. That’s not iteration. That’s rework — the expensive kind that doesn’t show up in sprint boards but absolutely shows up in burnout and delayed roadmaps.

The mechanism is predictable. AI tools excel at generating code that solves the immediate problem but lacks context about the broader system. The code works, passes tests, gets merged.

Three weeks later, another engineer hits a design incompatibility or an edge case the model didn’t anticipate, and the code gets rewritten. Repeat indefinitely.

Tracking Code Durability requires a simple Git analysis script: for any commit tagged as AI-generated, check whether that code exists unmodified 30 days later. Git blame and diff history provide everything you need — no new tooling required. Export to your DORA dashboard as a new data series.

Teams running parallel AI coding agents — where multiple autonomous agents produce code concurrently — need this metric more than anyone. The compounding churn risk from parallel AI output without durability tracking is the fastest path from “impressive throughput numbers” to a production meltdown that your DORA dashboard never saw coming.

Metric 3 — Complexity-Adjusted Throughput: Separating Real Velocity From AI Boilerplate Inflation

Raw throughput — PRs merged, lines shipped, story points closed — is the DORA metric most directly inflated by AI. Complexity-Adjusted Throughput reweights it.

The core idea: not all PRs are equal. A PR that adds five lines to a complex payment processing module and a PR that autocompletes a hundred-line CRUD endpoint both count as one PR in your dashboard. Counting them the same way tells you how busy your AI tools are, not how effectively your engineers are working.

Complexity weighting can be applied through cyclomatic complexity scores (available in most static analysis tools), PR size relative to repo complexity, or domain tagging — business logic versus configuration versus boilerplate. The exact formula matters less than consistency. The goal is a trending metric, not an absolute one. If Complexity-Adjusted Throughput is flat while raw throughput climbs, AI is generating more output while your engineers aren’t necessarily building harder or more valuable things.

This is also where multi-model AI stack performance in production becomes relevant. Different tools produce meaningfully different output profiles. Cursor tends toward completions in existing files; Claude Code tends toward larger agentic tasks. The complexity distribution of AI-generated PRs varies by tool, which means per-tool breakdowns in your complexity analysis will produce more accurate signals as your telemetry matures.

How to Implement All Three Without Replacing Your Existing DORA Stack

These three metrics are additions, not replacements. You keep deployment frequency, lead time, change failure rate, and MTTR — you add context.

Step 1: Add commit metadata tagging. Create a pre-commit hook or GitHub Action that prompts contributors to tag AI-assisted commits. Binary tagging (human vs. AI-assisted) gives you AI Code Share at the commit level immediately. GitHub Copilot’s API and Faros AI’s ingestion layer can automate this for Copilot-heavy teams.

Step 2: Build a durability query. Most Git platforms support querying file history through the API. A weekly cron job that checks survival rates of AI-tagged code at the 30-day mark gives you Code Durability without new tooling investment. Export to your existing dashboard as a new data series.

Step 3: Add complexity scoring to your PR pipeline. Tools like CodeClimate, SonarQube, or the open-source lizard output cyclomatic complexity per file. Weight your PR merge count by the average complexity score of changed files. Run this as a weekly rollup alongside your raw throughput numbers.

Step 4: Set baselines before communicating upward. Run all three metrics for 60 days before presenting to leadership. You want trend lines, not snapshots. A single week where AI Code Share hits 70% tells you nothing; a quarter where it climbs from 30% to 65% while Code Durability drops tells you everything. One data point worth anchoring this conversation on: 31% of PRs are now merging with no review at all in high-AI-adoption teams, per Faros AI. If your PR approval policy doesn’t account for AI-generated code, that number is likely climbing on your team right now.

What ‘Elite’ Should Mean Now — Recalibrating DORA Benchmarks for AI-Augmented Teams

The original DORA elite thresholds — multiple deploys per day, lead time under one hour — were calibrated against teams of human engineers. They’re not wrong benchmarks. They’re benchmarks for a different population.

DORA’s 2025 evolution acknowledges this directly. Retiring the four-tier model and introducing team archetypes was a response to AI distorting throughput baselines. When every team now hits “elite” deployment frequency because autonomous agents are opening boilerplate PRs at volume, “elite” has lost its meaning.

A working redefinition for AI-augmented teams:

AI Code Share target: Under 50%, with mandatory architectural review for AI-generated PRs touching core business logic
Code Durability target: 80%+ of AI-generated code surviving 30 days unmodified — current churn data suggests most teams are closer to 60%
Complexity-Adjusted Throughput: Trending upward quarter over quarter, with raw throughput and adjusted throughput moving in the same direction — sustained divergence signals boilerplate inflation that will eventually show up as incidents

These thresholds are directional starting points. Your right baseline depends on domain, AI tool mix, and incident tolerance. The point is to have a framework that distinguishes a team genuinely improving from a team generating more code faster while technical debt accumulates invisibly.

Your DORA Metrics Aren’t Wrong — They’re Incomplete

Your DORA metrics and AI coding tools don’t have to be in conflict. Deployment frequency, lead time, change failure rate, and MTTR remain meaningful signals — when read with context. The problem is that without AI attribution, they’ve become incomplete signals dressed up as complete ones.

AI Code Share gives your existing DORA metrics the denominator they didn’t know they needed. Code Durability catches quality degradation before it becomes a 3am incident. Complexity-Adjusted Throughput separates genuine engineering progress from throughput inflation driven by autocomplete.

You don’t need to replace your dashboard. You need three new data series on the one you already have.

Start with commit tagging this sprint. Run the durability query next week. By next quarter, you’ll have the honest performance picture that your current DORA dashboard only pretends to show.

The DORA Dashboard Lie — What ‘Green’ Looks Like When AI Writes Half the Code

How Each Classic DORA Metric Gets Distorted (And Why It’s Not DORA’s Fault)

The Data Is In: Faros AI, GitClear, and the 2025 DORA Report All Found the Same Paradox

Metric 1 — AI Code Share: The Missing Context Layer That Makes Your DORA Numbers Readable Again

Metric 2 — Code Durability: Catching Architectural Fragility Before It Becomes a 3am Incident

Metric 3 — Complexity-Adjusted Throughput: Separating Real Velocity From AI Boilerplate Inflation

How to Implement All Three Without Replacing Your Existing DORA Stack

What ‘Elite’ Should Mean Now — Recalibrating DORA Benchmarks for AI-Augmented Teams

Your DORA Metrics Aren’t Wrong — They’re Incomplete

Leave a Reply Cancel reply

Related Posts

How to Orchestrate Multiple AI Coding Agents Without Chaos

Beyond Hello World: Building a Production-Ready MCP Server

AI-Generated Code Security CI/CD: A Practical Guide

The Hidden Water Bill: Every ChatGPT Response Costs You a Bottle of Water