BlogEngineering

AI Code Ratio Benchmarks: What Percentage of AI-Generated Code Is Healthy?

AI Code Ratio Benchmarks: What Percentage of AI-Generated Code Is Healthy?

0-20% barely used, 20-60% healthy, 60-90% heavy, 90%+ rigorous review. Benchmark ranges for AI-to-manual code ratio — and the drift metric that matters more than the absolute number.

A VP of engineering pulls up a slide in the all-hands and asks the room: “We’re at 78% AI-generated code across the team. Is that good or bad?” The tech lead in the second row has no answer. Neither does anyone else. The number is real — pulled from a dashboard that parses commit metadata — but no benchmark exists to compare it to. Is 78% the high-water mark of a high-leverage team? Or the warning siren of a team that has stopped reading its own code?

This is the gap right now. Every team measuring AI usage is measuring it for the first time in their own data, with no industry reference point to anchor against. GitHub Copilot adoption surveys measure who has the tool installed. IDE telemetry counts tab-completes. Neither of those answers the question that actually matters to a tech lead or a VP: what fraction of the code we shipped this week was authored by the AI rather than the human?

That’s a different number. It’s measurable. And once you measure it, you need to know what range is healthy.

How AI Code Ratio Is Actually Calculated

The ratio is simple in concept and surprisingly hard to get right in practice. AI-authored lines divided by total lines, attributed at the commit level, then rolled up per developer, per team, per time window.

The hard part is attribution. Most tools that claim to measure “AI productivity” infer it from proxy signals — counting Copilot tab-completion events from IDE plugins, or logging which files were open while the assistant was running. Those signals are noisy. A developer can accept a Copilot suggestion and rewrite half of it before commit. A developer can write a function entirely by hand in a session where the assistant happened to be running. Inference-based attribution conflates the tool was active with the tool wrote the code.

First-party attribution reads the artifact instead of guessing at it. There are two signals worth reading. The first is OpenTelemetry: editors like Claude Code and OpenCode emit lines_of_code metrics tagged ai or manual as the assistant makes Edit and Write tool calls, aggregated into a per-task line count before the work ever reaches git. The second is the commit record: when Claude Code authors a commit, it adds a Co-Authored-By: Claude <noreply@anthropic.com> trailer that’s parseable as ground truth. Tandemu prefers the OTEL signal when available — that path works for both editors — and falls back to commit-trailer parsing for Claude Code when OTEL isn’t wired up. We don’t infer the ratio. We read it.

This matters for the rest of the post because every benchmark below assumes you’re measuring at the artifact, not at the keystroke. A team running a survey-based or IDE-plugin-based number will see ratios that drift upward by 10–20 points just from instrumentation noise. If the number you’re comparing against this framework comes from inference, recalibrate the source first.

The taxonomy of attribution methods — inference vs. heuristic vs. first-party — deserves its own treatment, and we’ll cover it in a follow-up post. For this one, assume first-party.

The Four Tiers

Once attribution is clean, the per-developer (or per-team) AI ratio falls into one of four bands. These are the canonical tiers Tandemu publishes in its methodology docs, expanded here with what each tier should trigger in practice.

RatioLabelSignalWhat to do
0-20%Barely usedTools deployed but not adopted; developers default to writing by hand even when the assistant is runningAudit prompts, run a workflow review, surface the friction blockers preventing adoption
20-60%Healthy mixImplementation goes to AI; critical, architectural, and judgment-heavy code stays manualMonitor drift, watch for module-level concentration, leave well enough alone
60-90%Heavy AI usageAI is the default first draft; developer time shifts from writing toward verification and reviewTighten review on payments, auth, data-flow, and other high-stakes modules
90%+Almost entirely AI-generatedDeveloper is editor, not author; original code production has moved to the assistantMandatory architectural review at the PR gate; flag for skill-atrophy risk over time

The temptation when looking at a table like this is to ask “what’s the right number?” and pick one. That’s the wrong question. The right number depends on what kind of code is being written.

A 70% AI ratio on a greenfield CRUD service is healthy. The patterns are well-trodden, the assistant has training data that maps cleanly to the work, and a human writing the same code by hand is just slower at producing the same output. A 70% AI ratio on a distributed systems core — consensus protocols, transaction coordinators, retry logic with subtle ordering constraints — is a code review red flag. That’s code where the assistant doesn’t have strong priors, where small mistakes have systemic consequences, and where the human is the source of architectural intent that doesn’t exist anywhere else.

Same number, opposite interpretations. Any answer to “what’s a healthy AI ratio?” that doesn’t account for what the code does is missing the actual signal. The bands above are the starting frame. Module class is the multiplier on top.

This is the nuance most published guidance can’t reach. A vendor selling an SEI dashboard wants to give a single benchmark because a single benchmark sells. The honest answer requires more dimensions, which is why it stays unanswered in most public writing about AI productivity.

AI Ratio Drift: The Metric That Matters More Than the Number

Here’s the framing shift that changes how you read the bands above: the absolute ratio is a lagging indicator. The change in the ratio is a leading one.

We call this AI ratio drift: the week-over-week change in per-developer AI ratio, signed and tracked over time. A team holding steady at 65% is healthier than a team accelerating from 30% to 80% in three weeks, even though the second team has a “better” number on any given day. The first team has reached a stable equilibrium. The second team is in motion, and motion in this metric usually has a cause worth understanding.

Three drift signatures show up consistently in the deployments we’ve watched:

  • Healthy ramp. A gradual rise of 1–3 points per week as developers get better at prompting, learn which tasks the assistant handles well, and stop fighting it on the rest. This is what adoption is supposed to look like. It plateaus around the team’s natural ceiling — usually somewhere in the 50–70% range — and stays there. No intervention needed.

  • Cliff jump. A sudden spike of 20+ points in a single sprint. There are two causes worth distinguishing. The benign one: a new tool deployed, a new model rolled out, a workflow change that legitimately shifted the split. Verify it and move on. The concerning one: a single developer offloading judgment, where the ratio spike correlates with a specific person and a specific module class they shouldn’t be auto-piloting through. That’s an intervention conversation, not a metric anomaly.

  • Inverted drift. The ratio falls week over week. Adoption is going backward. This almost never means developers stopped wanting to use the assistant. It usually means friction is winning — the tool is failing on the work they actually need to ship, so they’ve quietly reverted to manual. Cross-reference the friction map. If inverted drift correlates with rising friction severity in the same modules, the assistant has run out of context for that codebase and the team is paying for it.

What drift gives you that the absolute number can’t: a way to act before the dashboard tells you something is broken. A team that drops from 65% to 45% over a month is showing you the same problem as a team running daily standups and hearing “I just wrote it myself, the AI was being weird.” Drift is the same signal, measured passively, three weeks earlier.

This is also why a benchmark table is necessary but not sufficient. Two teams can both sit at 60% AI ratio. One has been stable at 60% for six months. The other arrived there last Tuesday from 25%. Those are different teams, with different review needs, and a static benchmark treats them identically. Drift is what lets the framework tell them apart.

Reading the Number Without Being Read By It

The objection that comes up every time this gets discussed in public is some version of: “We’re at 80% AI-generated code, are we over-relying on the tool?” The answer is almost always no, and the reason is that the framing of the question is wrong.

High AI ratio is not the failure mode. Unreviewed high AI ratio is the failure mode. A team that ships 90% AI-authored code and reviews every line of it as if a junior wrote it is fine. A team that ships 50% AI-authored code and waves it through because the test suite passed is the one accumulating the technical debt. The risk lives in the review discipline, not in the authorship split.

The other thing the absolute number hides: where the AI-authored code lives. Pair AI ratio with friction severity from the friction map and the dangerous combination becomes visible. High AI ratio plus low friction in payment-processing code means either the assistant has unusually strong priors for your payment stack, or — more often — the team has stopped thinking carefully about that code because the tests still pass. High AI ratio plus high friction in any module means the assistant is writing code the team is then spending hours debugging. Either pattern is worth a conversation. The single ratio number doesn’t tell you which one you have.

One last piece worth saying out loud, because it comes up every time leadership asks for AI productivity numbers: a commit-metadata-based AI ratio is privacy-respecting by construction. It measures the artifact, not the keystroke. The developer is not being watched. The git history is being read. That distinction matters for trust, and it matters for whether the team will tolerate the metric existing in the first place.

Start Measuring the Drift, Not the Number

The right AI ratio for your team isn’t a number you can pull from someone else’s dashboard. It’s a function of what your code does, how the team reviews it, and which direction the ratio is moving. The bands in this post are the frame. Drift is the signal inside the frame.

This is what we built Tandemu to measure: per-commit AI ratios from first-party attribution, segmented by module class, tracked as drift over time, and cross-referenced with friction severity so the number is never read in isolation. The canonical tier definitions live in the methodology docs; the friction half of the equation is in the friction detection post.

The right AI ratio isn’t a number. It’s a direction. Measure the drift.