Why Your claude.md Stops Working

Context window attention dilution — why your governance rules lose weight as context grows

Last week Anthropic made the 1 million token context window generally available across Claude Opus 4.6 and Sonnet 4.6. No long-context premium. A 900K-token request costs the same per-token rate as a 9K one. The entire Harry Potter series, twice over, in a single prompt.

The developer community celebrated. Bigger context means more code, more documents, more conversation history. No more chunking. No more summarization workarounds. No more losing context halfway through a session.

Here’s the problem nobody is talking about: a bigger context window doesn’t mean the model pays more attention to your instructions. It means your instructions have more competition.

That claude.md file you spent an afternoon perfecting? The one with your team’s coding standards, your API conventions, your “always use TypeScript strict mode” rule? It’s about to get a lot less effective.

***

The Five Phases of claude.md Grief

If you’re using Claude Code, Cursor, Windsurf, or any AI coding assistant with a project instruction file, you’re somewhere on this curve:

Phase 1 — Awe. You tell the model to use TypeScript strict mode and it does. You correct it once about your internal API client and it remembers for the whole session. You think: this is magic. I just need to write down my rules and the AI follows them.

Phase 2 — Confidence. You write a comprehensive claude.md with 40 rules. Formatting standards. Import conventions. Error handling patterns. Naming conventions. You’ve basically written an onboarding doc for a junior developer, and the AI reads it every time. You feel like you’ve solved AI governance.

Phase 3 — Frustration. Halfway through a long session, the model uses fetch instead of your internal API client. Again. The rule is right there in the file. You scroll up — yes, it’s there, line 23, clear as day: “Always use apiClient from @internal/http, never raw fetch.” But the model ignored it. You correct it. It apologizes. Ten minutes later, it does it again.

Phase 4 — Workarounds. You’ve learned the tricks. Put the most important rules at the top. Repeat critical instructions in your prompts. Break sessions more frequently so context doesn’t accumulate. Keep the claude.md short — maybe the model pays more attention to 10 rules than 40. You’ve accepted that you need to babysit it. You’ve developed coping strategies for a tool that’s supposed to be autonomous.

Phase 5 — Architectural understanding. You realize the model isn’t ignoring your rules out of negligence. It’s a statistical system. Your rules aren’t rules — they’re tokens competing for attention weight against every other token in the context window. And they’re losing.

Most users never get past Phase 4. They develop workarounds and accept the ceiling. But Phase 5 is where it gets interesting — because once you understand why the problem exists, you realize no amount of better markdown will fix it.

***

Context Is Not Memory. It’s a Competition.

Attention dilution at scale — 800 tokens of rules vs 8K, 200K, and 1M token context windows

Here’s what’s actually happening inside the transformer when you send a message with your claude.md loaded.

The model’s attention mechanism looks at every token in the context window and computes a relevance weight for every other token. This is how the model decides what to “pay attention to” when generating its next response. The key insight: attention is a fixed budget. The model distributes a finite amount of attention across all the tokens in the window.

When your context window is 8K tokens — your claude.md (800 tokens), a short conversation (2K tokens), and the current task (5K tokens) — your rules represent about 10% of the attention budget. They get meaningful weight. The model follows them reliably.

When your context window is 200K tokens — your claude.md (800 tokens), a long conversation with tool calls and code outputs (180K tokens), and the current task (19K tokens) — your rules represent 0.4% of the attention budget. They’re a whisper in a stadium.

When your context window is 1M tokens, your 800 tokens of rules represent 0.08% of the available attention.

This is the attention dilution problem. It’s not a bug. It’s a fundamental property of how transformer attention works. More context means more tokens competing for the same attention budget. Your governance rules don’t get louder as the context grows — they get quieter.

Anthropic’s own documentation acknowledges this. Their context window docs now state: “As token count grows, accuracy and recall degrade, a phenomenon known as context rot. This makes curating what’s in context just as important as how much space is available.”

Read that last sentence again. Curating what’s in context is just as important as how much space is available.

That’s not a prompting tip. That’s an architectural statement. And it’s one the industry hasn’t internalized yet.

***

The 1M Token Paradox

So Anthropic gives everyone a million tokens and removes the pricing premium. The instinct is to use it — load the entire codebase, keep the full conversation history, never throw anything away. Context is cheap now. Why not use it all?

Because more context doesn’t mean better governance. It means worse governance.

Your claude.md was already struggling at 200K tokens. At 1M tokens, it’s fighting for attention against 1,250 times more content than it contains. The model’s attention mechanism is distributing weight across a million tokens, and your 40 carefully written rules are statistically indistinguishable from noise.

This is the paradox: the feature that makes the model more capable at reasoning over large inputs simultaneously makes it less reliable at following instructions. The capability improvement and the governance degradation come from the same source — a bigger context window.

The teams that will succeed with 1M token contexts are not the ones who dump everything in and hope. They’re the ones who treat context management as an engineering discipline — deciding what goes in, what stays out, and what gets priority when attention is scarce.

***

The Problem Isn’t the Model. It’s the Architecture.

Let’s be precise about what’s failing. The model isn’t defective. Claude’s attention at 1M tokens scores 78.3% on MRCR v2 — the highest among frontier models. The model is doing its job. The problem is that we’re asking it to do two different jobs with the same mechanism.

Job 1: Reason over a large body of information. This is what the 1M context window is for. Analyze a codebase. Compare documents. Trace a long conversation. This job benefits from more context.

Job 2: Follow governance rules reliably. This is what claude.md is for. Use the right API client. Follow the team’s formatting standards. Never commit secrets. This job suffers from more context — because the rules get drowned out.

The reason both jobs fail in the same window is that we’re treating governance rules as content. We’re putting them in the same undifferentiated token stream as everything else and hoping the statistical attention mechanism decides they’re important. Sometimes it does. Sometimes it doesn’t. We have no control over which.

This is the architectural insight: governance of AI systems is not a prompting problem. It’s a memory management problem.

***

What the Database World Already Learned

The AI industry is rediscovering a problem that database engineers solved in the 1970s.

Early databases loaded everything into memory — all the data, all the indexes, all the schema definitions, all in one flat memory space. It worked for small datasets. It fell apart at scale, for exactly the same reason claude.md falls apart at scale: when everything competes for the same resource, the important stuff gets drowned out by the voluminous stuff.

The solution was the buffer manager — a layer that sat between storage and the query executor, managing what was in memory at any given moment. Pages were classified by type. Index pages got different treatment than data pages. Hot pages stayed in memory. Cold pages were evicted. Critical system pages were pinned — they could never be evicted, regardless of memory pressure.

The query executor didn’t decide what was in memory. The buffer manager decided. The query executor operated on whatever working set the buffer manager gave it. This separation of concerns — the executor reasons, the buffer manager governs the working set — is the fundamental insight that made databases scale.

AI systems need the same architecture. The model is the query executor. The context window is the buffer. And right now, there is no buffer manager. We’re loading everything into memory and hoping the query executor figures out what’s important. That stopped working for databases in 1975. It’s not going to work for AI in 2026.

***

What the Solution Looks Like

The answer isn’t writing better claude.md files. The answer is replacing the text file with an architecture.

Classified memory pages. Not all context is equal. Governance rules, domain knowledge, conversation history, tool outputs, and speculative reasoning artifacts should be different classes of content with different management policies. Governance rules should be pinned — architecturally guaranteed to be present and impossible to evict, no matter how much other content accumulates.

Managed promotion and eviction. When context budget is limited (and it always is, even at 1M tokens), the system should intelligently decide what enters the context and what gets evicted. Tool outputs from three steps ago that are no longer relevant should be evicted. The rule about using the internal API client should stay.

Consequence-aware prioritization. The stakes of the current task should determine what’s in context. When the model is brainstorming (low consequence), governance rules can be lighter. When the model is about to deploy code to production (high consequence), governance rules should dominate the working set. The same instruction file doesn’t serve both modes.

Context composition auditing. For every decision the model makes, you should be able to answer: what was in its context window? Which rules were present? Were the relevant governance standards loaded? If the model made a mistake, was it because the rule was absent from context, or because the rule was present but diluted?

This isn’t hypothetical. It’s the same architecture that makes every production database, every operating system, and every managed runtime reliable at scale. The AI industry just hasn’t built it yet.

***

The Race to Govern the Context Window

From text files to architecture — the buffer manager for AI context

The competitive landscape is moving fast. Claude Code has claude.md. Cursor has .cursorrules. Windsurf has project instructions. GitHub Copilot has custom instructions. Every AI coding tool has converged on the same pattern: a text file that tries to tell the model how to behave.

They’ve all converged on the same pattern because the demand is real — teams need their AI tools to follow organizational standards. But they’ve also all converged on the same limitation — a text file in a context window is governance-as-content, and governance-as-content doesn’t scale.

The tool that replaces the text file with an architecture will own this layer. Not a better text file. Not a longer text file. Not a text file with clever formatting. An architecture that manages what the model reasons about, with the same rigor that a database engine manages what the query executor operates on.

The 1M token context window makes this more urgent, not less. Every team that adopts it will discover — within weeks — that their instruction file stopped working halfway through their first long session. They’ll enter Phase 3 of the adoption curve. And they’ll be looking for a solution that isn’t “write shorter rules and restart more often.”

The window is open. The problem is universal. And the solution is architectural.