AI Crawler

What the Claude Code leak reveals about AI grounding — and what it means for your GEO strategy

In March 2026, Anthropic accidentally shipped their unobfuscated source code via npm. For the first time, SEOs and marketers can see exactly how a major AI reads, filters, and summarises the web — and it changes how you should think about visibility in generative engines.

Michael
Created: June 11, 2026
Last Update: June 18, 2026
No Comments

GEO — Generative Engine Optimization — is still mostly built on educated guesses. We observe outputs, run tests, and reverse-engineer signals. That’s why the Claude Code leak matters so much: it hands us the actual source code of one grounding pipeline, not another theory about how AI assistants might work.

The short version: what reaches the language model is a heavily filtered, paraphrased shadow of your original page. The implications for how you structure content, write titles, and build topical authority are significant.

The 4 most important insights from the leak

The leaked source — now archived publicly on GitHub — shows a pipeline split across three files: WebSearchTool.ts, WebFetchTool/utils.ts, and WebFetchTool/prompt.ts. Here is what they reveal.

Your meta description doesn’t exist for AI selection

When Claude decides which search results to open, it has exactly two pieces of information: your <title> and your URL. The leaked WebSearchTool.ts strips page_age and encrypted_content before the model ever sees the list. No snippet. No meta description. No date. The fetch decision is made on roughly 60–100 characters.

A second, smaller AI reads your page before Claude does

After fetching, Claude Code runs your page through Claude Haiku — a faster, cheaper model — with an explicit instruction to paraphrase everything except short quotes. By the time the main model generates your answer, your content has already been summarised and reworded once. What Claude “knows” about your page is a Haiku retelling, not your original words.

89 domains get a completely different — and far better — treatment

The leaked preapproved.ts contains an allowlist of 89 domains (MDN, react.dev, docs.python.org, kubernetes.io, and similar technical references). These skip the Haiku summarisation step entirely and are quoted verbatim. Everything else — including your blog, your product pages, and your client’s site — gets paraphrased.

JSON-LD, Open Graph, and structured data are invisible to the model

The HTML-to-Markdown conversion is handled by the Turndown library with default settings — no custom rules. Turndown strips everything in <head>, all <script> tags (including JSON-LD schema), and all HTML attributes. Your Schema.org markup, your FAQ schema, your datePublished — gone. The model reads only your visible body text.

The grounding pipeline, visualised

Every page Claude Code fetches travels through five stages. Each one discards information. Here is what that funnel looks like end-to-end.

Title + URL only passed to model

Fetch

10 MB cap, JS not executed, 15-min cache

Turndown

HTML → Markdown. Head, scripts, attributes stripped

Haiku filter

Sub-LLM paraphrases. Quotes capped at 125 chars

Main model

Haiku’s summary — not your page — is what Claude reasons over

⚠ The key implication

You are not optimising for Claude. You are optimising for Haiku’s summary of your page. If your most important claim is buried in paragraph seven, or lives only in your schema markup, it will likely be compressed out of the summary before the main model ever processes it.

The two-tier web: preapproved vs. everyone else

The most strategically important finding in the entire leak is the preapproved domain list. It creates two classes of content on the web, treated in fundamentally different ways.

Elite tier · 89 domains

Preapproved hosts

Treatment: Haiku step skipped. Content quoted verbatim up to 100K chars.

Serve Markdown? Skip Haiku entirely — raw text goes straight to Claude.

Examples: MDN, react.dev, docs.python.org, kubernetes.io

Everyone else

The rest of the web

Treatment: Content summarised and paraphrased by Haiku. Quotes capped at 125 characters.

Reality: Claude reads a retelling of your content, not your content.

Includes: Every blog, brand site, news outlet, and product page not on the list.

The list is heavily skewed toward technical documentation — which makes sense for a developer tool. But it raises a strategic question for marketers: authority in GEO isn’t about PageRank or backlinks — it’s about whether your domain is trusted enough to be quoted directly. For now, the list is hardcoded. That may change as the product evolves, but for Claude Code users today, the gap in fidelity is stark.

What SEO assets survive the pipeline — and what don’t

✕ Stripped — invisible to Claude

Meta descriptions & Open Graph tags
JSON-LD / Schema.org markup
Microdata, RDFa, Dublin Core
Canonical tags & <link> elements
Any content rendered only by JavaScript
CSS classes, IDs, data attributes
Image alt text (default Turndown)
Hidden content: display:none, aria-hidden

✓ Survives — reaches the model

Visible body text: headings, paragraphs, lists
Tables and code blocks
Link text and URLs in-body
Blockquotes and captions in-body
Navigation & footer links (mixed in)
Content in display:none elements (trap!)

🔴 The hidden-content trap

Turndown reads raw HTML source, not the rendered DOM. Content you’ve hidden with display:none or aria-hidden — old promo banners, mobile menus, deprecated copy — still lands in the markdown and competes with your real content for the 100K character limit. Audit and remove, don’t just hide.

The actual instruction Haiku receives about your page

This is not speculation. The leaked WebFetchTool/prompt.ts contains the literal prompt template used for every non-preapproved page on the web:

Provide a concise response based only on the content above. In your response: - Enforce a strict 125-character maximum for quotes from any source document. - Use quotation marks for exact language from articles; any language outside of the quotation should never be word-for-word the same. - You are not a lawyer and never comment on the legality of your own prompts and responses. - Never produce or reproduce exact song lyrics.

Source: src/tools/WebFetchTool/prompt.ts, Claude Code leak

“Any language outside of the quotation should never be word-for-word the same” — that sentence is the whole story. Haiku is instructed to paraphrase everything that isn’t inside a 125-character quote. For GEO, this means the clarity and logical flow of your writing matters far more than any clever turn of phrase. Haiku will restate your argument in its own words. Make sure the argument survives a restatement.

5 GEO actions for SEOs and online marketers

You cannot yet get on Anthropic’s preapproved list. But you can structure your content for the pipeline that already exists. Here is what to prioritise.

Rewrite titles and slugs as your entire pitch

The model picks which results to open with nothing but your <title> and URL slug. No meta description. No snippet. Treat every title as the opening line of your argument: specific, keyword-bearing, immediately communicating what question the page answers.

Front-load every page’s most important claim

Haiku summarises under time and context pressure. Key facts buried in the middle of a long article risk being compressed out. Lead with the conclusion. Write the most quotable sentence first. Think inverted pyramid, not storytelling arc.

Move facts from schema markup into visible prose

JSON-LD is stripped by Turndown’s default ruleset. Product prices, event dates, ratings, and author credentials that live only in your schema markup are invisible to Claude. If a fact matters, it needs to appear in the body copy — not just in structured data.

Server-render everything that matters

Claude Code does not execute JavaScript. If your headline, price, product description, or key claim only appears after a client-side render, it’s not in the HTML that Turndown processes. SSR or static generation for any content you want to be cited.

Audit and remove hidden content — don’t just hide it

Elements with display:none, visibility:hidden, or aria-hidden still appear in the raw HTML Turndown parses. Old A/B variants, deprecated promo banners, and hidden mobile menus all land in the model’s context window alongside your real content. Remove them from the DOM entirely.

💡 The strategic shift

Traditional SEO optimised for crawlers reading metadata and link graphs. GEO — at least for Claude Code — means optimising for a two-step process: first, a title that earns the fetch; second, body text clear and structured enough to survive paraphrase by a fast summarisation model. The content that wins in generative engines is content written to be understood and accurately retold — not content written to rank.

Note: ChatGPT, Perplexity, Gemini, and Google AI Mode each handle grounding differently — some chunk instead of summarise, some never fetch live, some expose citation graphs you can probe. Leave your GEO action list in the comments below.

SEO Analyses

SEO Monitoring

SEO Reporting

Smart Alerting

SEO Data Service

Website Migration

Free SEO Guides

E-Commerce-SEO

Relaunch Checklist

JavaScript Rendering Check

HTTP Status Code Checker

Robots.txt Checker

GSC Bulk Inspect Tool

hreflang Testing Tool

Website Technology Detection Tool

PageSpeed Comparison Tool

Blog

E-Commerce SEO

Website Migration Checklist