If Anthropic Can't Govern AI Through the Model, Neither Can You

AI Governance: Why Architecture Beats Alignment — comparing policy vs architecture approaches

Last week Anthropic released the third version of its Responsible Scaling Policy—the voluntary framework they use to manage catastrophic risks from their own AI systems. It’s a serious document from serious people, and I recommend reading it.

But buried in the honest, candid assessment of what worked and what didn’t is a finding that every CTO deploying AI at scale should internalize:

The model maker couldn’t govern the model through the model.

***

What Anthropic Actually Admitted

The RSP v3 isn’t just a policy update. It’s a post-mortem. And Anthropic deserves credit for writing it honestly.

Their original theory of change had four pillars: internal forcing function, race to the top, industry consensus, and eventual government coordination. Two and a half years later, their assessment is that two of those four largely failed.

The consensus-building didn’t happen. The government coordination didn’t happen. And critically, they found that the closer they got to important capability thresholds, the harder it became to make a clear external case for action—because the evidence was ambiguous, the evaluations were immature, and the political environment had shifted.

So what did they do? They built more external infrastructure.

Risk Reports. Frontier Safety Roadmaps. External reviewers with unredacted access. Centralized records analyzed by AI for anomalous behavior. A “regulatory ladder” policy roadmap.

Notice what all of these have in common: they live outside the model.

Anthropic is not solving their governance problem by making Claude more aligned. They’re solving it by building systems, processes, and accountability structures that operate independently of what Claude thinks or does. The governance layer is external. It has to be.

***

The Principle They Proved

There’s a thesis I’ve been developing for the past two years, and the RSP v3 is its highest-profile validation yet:

Models execute. Systems govern. Authority must live outside the model.

This isn’t a philosophical position. It’s an engineering constraint. You cannot embed reliable governance inside the thing being governed. The model can be instructed, guided, fine-tuned, and aligned—but the governance infrastructure that defines what “correct” looks like, enforces it consistently, and audits compliance has to exist in a layer the model cannot modify.

Anthropic figured this out at the civilization scale. They’re now building the external governance infrastructure to match.

Most enterprises are still trying to solve the same problem by writing better prompts.

***

What This Means for Your AI Coding Deployment

Let’s bring this down from the existential to the practical.

Right now, your engineering teams are deploying AI coding tools at scale. Claude Code, GitHub Copilot, Cursor, Codex—pick your stack. Developers are interacting with these tools hundreds of times per day. The tools are producing code that goes into production.

What’s governing that?

If your answer is “our CLAUDE.md files” or “our system prompts” or “our code review process,” you’re doing the equivalent of Anthropic trying to manage ASL-4 risks by making Claude nicer. You’re putting governance inside the thing being governed.

The RSP v3 identifies three specific failure modes that Anthropic encountered at scale. Each has a direct enterprise parallel:

The zone of ambiguity. Anthropic found that model capabilities often approached thresholds without clearly crossing them—making it hard to know when to act. Enterprises face this daily: AI coding tools are producing code that looks right, passes review, and ships—until it doesn’t. You don’t have systematic visibility into whether your aggregate AI guidance is drifting toward or away from your architectural standards.

The correction loop doesn’t propagate. Anthropic learned things from ASL-3 implementation that should have informed industry-wide action. It didn’t, because there was no mechanism to make it happen. Your senior engineers are correcting AI coding mistakes constantly. Those corrections are improving local outcomes. They’re not becoming organizational knowledge.

Unilateral compliance becomes impossible at scale. Anthropic explicitly states that higher ASL requirements may be impossible for one company to achieve alone. The enterprise equivalent: one team’s well-governed AI coding workflow doesn’t protect you from another team’s ungoverned one. Governance has to be systemic or it isn’t governance.

***

The External Governance Stack

Anthropic’s response to these failure modes is instructive. They didn’t release RSP v3 and say “we’ve fixed Claude.” They released a framework for external accountability: published roadmaps, third-party reviewers, risk reports on a regular cadence, centralized audit records.

The governance stack is external, systematic, and independent of model behavior.

Your enterprise AI governance stack needs the same properties. It needs to:

Capture corrections at the point of occurrence—not after code review, not in a retrospective, but when the developer overrides the AI and knows why. That moment contains the organizational knowledge you need.

Promote corrections that represent genuine standards—not local preference, not one engineer’s opinion, but patterns that reflect architectural intent and should apply across the organization.

Enforce standards as invariants—not suggestions in a markdown file that an agent can ignore, but constraints that travel with every code generation event regardless of which team, which repo, or which tool is involved.

Audit and report systematically—the equivalent of Anthropic’s Risk Reports, applied to your AI coding infrastructure. What is the aggregate quality of the guidance being injected into your AI tools? Where are the gaps between current practice and architectural standards? What’s drifting?

***

The Hardest Part of the RSP v3

The most honest line in the entire document is this: Anthropic considered defining higher ASL safeguards in ways that made compliance easy to achieve—and chose not to, because it would undermine the intended spirit of the policy.

That’s a company choosing rigor over comfort. Choosing to acknowledge the hard problem rather than paper over it with a weaker standard.

Most enterprise AI governance is doing the opposite. It’s choosing the standard that’s easy to claim compliance with—a well-written system prompt, a thoughtful CLAUDE.md, a code review checklist—rather than the one that actually addresses the structural problem.

The structural problem is this: you have no external, systematic, model-independent governance layer for your AI coding infrastructure. The corrections your best engineers make never escape the local context. The standards your architects define never reach the execution layer. The drift between intent and reality is invisible until something breaks.

Anthropic spent two and a half years and three policy versions learning that governance has to be external to work. You don’t have to spend that long learning the same lesson.

***

Conclusion

The RSP v3 is worth reading not because Anthropic got everything right—they’d be the first to tell you they didn’t—but because they’re being honest about what they got wrong and why. That honesty is rare in this industry and valuable.

The lesson for enterprise AI leaders is simple: if the organization that builds the model can’t govern it through the model, you certainly can’t govern your deployment through the model.

Build the external layer. Make it systematic. Make it independent. Make the corrections escape the repo.

The model executes. The system governs. Authority lives outside the model.

If Anthropic Can’t Govern AI Through the Model, Neither Can You

What Anthropic Actually Admitted

The Principle They Proved

What This Means for Your AI Coding Deployment

The External Governance Stack

The Hardest Part of the RSP v3

Conclusion

Governance that holds