The 30% Problem: Why AI Is Writing More Code and Shipping Less of It

CircleCI's data across 29 million CI workflows shows main branch success rates at a five-year low. Nearly one in three merges fails. The model isn't the variable. The architecture is.

There's a report out from CircleCI this week that nobody in the vibe coding crowd is going to like. It's based on 29 million CI workflows across thousands of real teams, and it tells a story that's almost exactly the opposite of what the AI productivity narrative has been selling.

Here's the headline number: main branch success rates just hit a five-year low. 70.8%. Which means that right now, nearly one in three attempts to merge AI-generated code into production is failing.

Let me say that differently. After two years of being told that AI will 10x developer productivity, the industry's actual delivery data shows that code is breaking at rates we haven't seen since before anyone was using these tools.

Something is very wrong with how teams are deploying AI for development. And I think I know what it is.

***

More Code, Less Software

The CircleCI numbers tell a specific story. Average throughput—the number of daily CI workflows run—went up 59% year over year. Teams are generating code faster than ever. Feature branches are busy. Everyone is vibing.

But the main branch tells a different story. The median team saw a 15% increase in feature branch throughput and a 7% decrease on main. The top 10% saw feature branches up nearly 50% and main branch activity essentially flat.

The code is being written. It's just not reaching production.

+59%
CI workflow throughput YoY
-7%
Main branch activity (median)
70.8%
Main branch success rate (5-yr low)
72 min
Mean time to recovery (+13% YoY)

The top 5% of teams managed to ship—main branch throughput up 26% alongside 85% feature branch growth. But they're one in twenty. The other nineteen teams built a lot of stuff that's sitting in review, failing CI, or getting rolled back.

And it's getting harder to fix when it breaks. Mean time to recovery on the main branch now sits at over 72 minutes on average, up 13% year over year. On feature branches—where you're supposed to catch problems early—MTTR jumped 25%.

The most brutal number in the report: financial services teams went from 111-minute MTTR in 2024 to 471 minutes in 2025. For regulated FinTech, that's not a performance problem. That's a compliance event waiting to happen. It's also consistent with a separate finding that nearly 45% of AI-generated code contains critical security vulnerabilities and ungoverned patterns—a number that should terrify any engineering leader shipping into a regulated environment.

***

The Model Isn't the Variable

Here's what the CircleCI report doesn't say directly but what the data implies: the performance gap isn't about which model you're using. It's about how you're using it.

The top 5% of teams aren't using better AI. They have better systems around their AI. The report's conclusion—“success in the AI era is no longer determined by how fast code can be written; the decisive factor is the ability to validate, integrate, and recover at scale”—is actually an architecture argument dressed up as a productivity argument.

I've been making a version of this argument for a while: the model is not the problem. The architecture is the problem.

When you inject AI into a development pipeline without governance—without structured constraints on what it can do, how it should do it, and why—you get exactly what CircleCI measured. Lots of output, poor integration, high failure rates, slow recovery. The model generates confidently and incorrectly because it has no stable reference point for your specific architecture decisions, your codebase conventions, your team's standards.

You're not getting 10x productivity. You're getting 10x volume with 30% failure rates.

The AI Productivity Paradox - Increased throughput and volume alongside decreased success and quality. Specification level data shows ungoverned code at 28% success, tuned at 73%, and governed at 95%.
The AI Productivity Paradox: More code written. Less code shipped. The specification level is the variable.

The specification level data makes this concrete. Teams with no specification—pure vibe coding—see 28% success rates. Barely better than a coin flip. Teams with tuned specifications reach 73%. Teams with governed specifications—800+ enforced architectural rules—reach 95%. The jump from 28% to 95% isn't a different model. It's a different architecture around the same model.

***

The Attention Dilution Problem Nobody Talks About

There's a specific mechanism behind these failures that the CircleCI report doesn't address because it's upstream of where their measurement happens.

Most teams “governing” their AI development are doing something like this: they have a .cursorrules file, or a long system prompt, or a CLAUDE.md that's grown to 200 lines over six months. Every rule they've ever thought of is in there. “Use ES modules.” “Never touch the auth layer without asking.” “Services go in /src/services.” Eighty rules, all loaded into every context window, every session.

The problem is that the model reads all of it and weights all of it roughly equally. It doesn't know that the rule about database connection pooling is more important than the rule about file naming conventions when it's trying to debug a production incident at 2am. It infers priority from a flat list of equally-weighted instructions.

That's not governance. That's noise.

What you get is exactly what CircleCI measured: code that looks compliant but fails integration. The model tried to honor 80 rules simultaneously and got most of them partially right, which in a complex codebase means the code fails when it hits the edge cases your static rules didn't anticipate.

The attention mechanism of every major model—Claude, GPT-4o, Gemini—is designed to process hierarchical, relevant context. When you inject a firehose of equally-weighted instructions, you're working against how these systems actually allocate reasoning capacity. The model has to spend its cognitive budget parsing governance boilerplate instead of thinking about your actual problem.

There's also a cost dimension here that gets ignored. Unstructured generation burns tokens at scale—$50K or more in wasteful context for large teams running continuous AI workflows. Enforced standards reduce that waste dramatically, down to roughly 400 tokens per execution when the injection is relevance-scored rather than dumped wholesale. That's not a rounding error. That's the difference between AI development being economically viable at scale or not.

***

What the Top 5% Are Actually Doing

The CircleCI data shows the top performers share something: they can absorb AI-driven acceleration without degrading delivery quality. Higher throughput AND higher success rates. That's not magic—it's architecture.

The report describes it as “autonomous validation” in CI/CD terms—using build history signals to validate what matters rather than running static checks against everything. That's directionally right. But the problem starts earlier in the pipeline, before CI ever runs.

The teams winning with AI development have figured out something about context management:

Relevance beats completeness. You don't need every rule in every context window. You need the right rules for the current task. A developer working on a new API endpoint doesn't need to see your database migration standards. They need your API design patterns, your authentication conventions, your error handling approach. Everything else is noise that dilutes the attention budget for the actual work.

Standards need to be earned, not declared. Static rule files accumulate. They grow with every incident, every code review, every “we should never do this again” moment. But they never shrink. After six months your CLAUDE.md is a dumping ground for every architectural concern anyone has ever had, and none of it is prioritized. Patterns that have proven themselves through repeated usage should carry more weight than rules someone added after a bad Monday.

The model needs deterministic structure around it. The most dangerous thing about LLM-driven development isn't the code quality—it's the auditability. When you let an LLM decide what the next step is, you lose the ability to explain the decision chain. You can't replay it. You can't tell the auditor what happened or why. Deterministic orchestration—LLMs as workers dispatched by code, not as orchestrators making routing decisions—is what lets you recover fast when things break. Because you know where to look.

***

The Messy Middle Is Real

One finding in the CircleCI report that deserves more attention: the U-shaped performance curve by company size.

Very small teams (2–5 people) perform well. Large enterprises (1000+) perform well. The messy middle—21 to 50 person companies—has the worst MTTR in the dataset at 174 minutes, nearly 3x worse than both ends of the curve.

That's not a coincidence. Small teams have informal coordination—everyone knows the architecture, decisions happen in conversation, the AI is a tool in a context-rich environment. Large enterprises have formal governance—standards bodies, review processes, architecture review boards, compliance frameworks. They move slower but the AI operates inside a defined structure.

The messy middle has outgrown the informal coordination but hasn't built the formal governance yet. Their developers are using AI tools the same way the small teams do—with ambient context and good intentions—but at a scale where that approach breaks down. Nobody knows what the other squad shipped last week. Patterns diverge. Standards drift. When the AI generates code, it's generating against a context that's inconsistent across the org.

That's the problem that governance tooling solves. Not because you need bureaucracy. Because you need the institutional memory to travel with every developer's AI session—automatically, consistently, without manual overhead.

***

What This Means for 2026

The CircleCI report frames the solution as “autonomous validation” in the CI/CD layer. They're not wrong—faster feedback on failures is valuable. But that's fixing the problem after it's already been embedded in code that made it through a development session, a PR, and a review.

The leverage point is upstream: at the moment the AI is generating code, not after it's already generated the wrong thing.

The teams that are going to win in 2026 are the ones that treat AI governance as an architecture problem, not a tooling problem. The question isn't which model to use. It isn't whether to use Claude or GPT-4o or Gemini. The question is what system you're going to put around the model—what constraints travel with every session, how those constraints stay relevant as the task changes, how decisions get logged and made auditable, and how patterns that work get promoted into standards that every developer benefits from.

The 30% failure rate in the CircleCI data isn't a model quality problem. It's an architecture quality problem. The models are doing what they're told. The problem is nobody has figured out how to tell them the right things at the right time in the right level of detail.

That's a solvable problem. The specification level data proves it. 28% to 95% is not a model upgrade. It's a governance decision.

The model isn't the variable. The architecture is.

***

The Equilateral open standards library is available at glidecoding.com—62 standards, 800+ enforcement rules, free to inspect and use. For teams that have hit the CLAUDE.md ceiling, MindMeld automates dynamic standards injection across models and sessions.