Rule referenceaeo/crawler-access

Crawler Access — Is Your robots.txt Blocking AI Answer Engines?

aeo/crawler-access parses your robots.txt user-agent by user-agent and checks 8 named AI crawlers — GPTBot from OpenAI, ClaudeBot from Anthropic, PerplexityBot, Google-Extended, and four more — warning once per fully blocked bot and escalating to an error only when every one is disallowed, so blocking them stays a deliberate choice you make, not a verdict the rule hands down.

Test your site for crawler access — is your robots.txt blocking ai answer engines?

Loading bot check… if this doesn't resolve in a few seconds, refresh the page.

We'll surface findings tagged with `aeo/crawler-access`.

What it detects

The rule reads your robots.txt and parses it into a map of user-agent to its Disallow patterns, lowercasing every agent name so the lookup is case-insensitive and stacking consecutive User-agent lines that share one rule block. It then walks a default list of 8 AI crawler user-agents: GPTBot (OpenAI), ChatGPT-User (OpenAI), ClaudeBot (Anthropic), PerplexityBot (Perplexity), Bytespider (ByteDance), Google-Extended (Google), CCBot (Common Crawl), and Applebot-Extended (Apple). You can override this list in pseolint.config.ts to add or remove agents.

For each crawler the rule asks one question: is this bot fully disallowed? A bot counts as blocked when its own block contains a root Disallow (`Disallow: /` or `Disallow: /*`), or when it has no rule of its own and falls back to a wildcard `User-agent: *` block that is itself fully disallowed. A bot with its own narrower block — say `Disallow: /admin/` — is not counted as blocked, because the rest of the site is still readable.

Every fully blocked crawler produces one warning naming that bot. If the count of blocked crawlers equals the full configured list — every AI agent disallowed — the warnings collapse into a single error instead, because total blocking is an unambiguous, site-wide decision worth one clear finding rather than 8 scattered ones.

Why it matters

Answer engines like ChatGPT, Claude, Perplexity, and Google's AI Overviews build their responses from pages their crawlers are allowed to fetch. If GPTBot, ClaudeBot, or PerplexityBot hit a `Disallow: /` in your robots.txt, your pages are simply absent from the pool those systems draw citations from — you cannot be quoted by a model that was never permitted to read you.

This is a tradeoff, not a mistake. Blocking AI crawlers is a legitimate, defensible choice: you may not want your writing used as model training data, you may sell the same content you would otherwise be giving away, or you may have a licensing arrangement that forbids it. The rule does not tell you that you must let these bots in. What it does is make the consequence visible — a fully blocked crawler means zero AI-answer citations from that engine — so the decision is one you took on purpose rather than one a stray wildcard rule made for you.

The severity split mirrors that intent. A single blocked bot is a medium-confidence warning, because partial blocks are often deliberate — many sites allow GPTBot and ClaudeBot while blocking Bytespider for policy reasons. Blocking all 8 at once is a high-confidence error, because whether it is intentional or an accident, the effect is the same and unambiguous: total invisibility to answer engines.

A page that fails

Brasswind Press, an independent tabletop-RPG publisher, ships this robots.txt across its store and SRD pages: ``` User-agent: * Disallow: /admin/ User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: PerplexityBot Disallow: / ``` The wildcard block only hides /admin/, so most bots are fine — but GPTBot, ClaudeBot, and PerplexityBot each carry a root `Disallow: /`. The rule emits 3 warnings, one per bot. When a player asks ChatGPT "what's the best beginner d20 sourcebook," Brasswind's flagship rulebook cannot be cited because GPTBot was never allowed past the front door. Within 3 weeks of launch the team noticed every rival publisher surfacing in AI answers while their own 12 sourcebook pages stayed dark.

A page that passes

Brasswind Press narrows the blocks so AI crawlers can read the free content while the unreleased campaign setting stays private: ``` User-agent: * Disallow: /admin/ Disallow: /unreleased-campaign/ User-agent: GPTBot Disallow: /unreleased-campaign/ User-agent: Bytespider Disallow: / ``` GPTBot now has its own block, but it is narrow — only the secret setting is hidden, so GPTBot is not counted as fully blocked. ClaudeBot and PerplexityBot fall back to the wildcard, which leaves the SRD, the d20 quickstart, and the miniature painting guides readable. Only Bytespider is fully disallowed, a deliberate single choice. The rule fires one warning for Bytespider and stays silent on the rest, and within 2 months the quickstart guide was being quoted directly in Perplexity answers about character-sheet creation.

How to fix it

  1. 1Open robots.txt and find every block with a root `Disallow: /`. For each named AI crawler you want quotable, delete that root rule so the bot can reach your public pages again.
  2. 2If you only meant to hide private areas, replace `Disallow: /` with the specific paths — for example `Disallow: /drafts/` and `Disallow: /admin/` — so the rest of the site stays crawlable by answer engines.
  3. 3Decide deliberately which bots you keep out. Blocking a scraper like Bytespider while allowing GPTBot and ClaudeBot is a valid stance; just confirm it is the stance you actually want.
  4. 4Remember the wildcard fallback: a `User-agent: *` block with `Disallow: /` silently blocks every AI crawler that has no rule of its own. Give bots you want to allow their own narrower block to escape it.
  5. 5After editing, re-run the audit. The rule downgrades from a site-wide error to per-bot warnings to silence as you reopen access, so you can watch each decision take effect.

SpamBrain context

Crawler access sits slightly apart from Google's SpamBrain quality signals: blocking an AI crawler is not spam and incurs no penalty. It is a publishing-rights decision, and the only thing at stake is reach into answer engines, not your standing in classic search.

That distinction is why this rule is built to be balanced rather than scolding. A SpamBrain-class rule says "this looks like manipulation"; this rule says "this is the visibility consequence of a choice you are entitled to make." GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot (Perplexity), and Google-Extended (Google) each respect robots.txt by their operators' own published policies, which is exactly what gives a Disallow rule real force — and what makes an accidental one genuinely costly. A site that meant to block a single training bot but pasted a wildcard `Disallow: /` can erase itself from every answer engine without ever touching its Google rankings.

The rule's job is to catch that gap between intent and effect. It names the real operators so you can weigh each one — a publisher might happily let Anthropic and OpenAI quote a free quickstart while refusing Common Crawl's CCBot — and it reserves its single error for the all-or-nothing case where the stakes are highest and the intent least likely to be deliberate.

Frequently asked questions

Why would an independent RPG publisher ever want to block AI crawlers?
Plenty of good reasons. If Brasswind Press sells a hardcover rulebook that took 18 months to write, handing the full text to a model that will paraphrase it for free undercuts the sale. A publisher may also have a licensing deal with an illustrator or co-author whose work cannot be used as training data, or may simply object on principle to their campaign settings feeding model training. The rule respects all of that — it warns so the choice is conscious, it never says you are wrong to make it.
What is the difference between a warning and an error here?
Each fully blocked AI crawler emits one warning at medium confidence, because a partial block is usually deliberate — allowing GPTBot but blocking Bytespider, for instance. The single error only appears when every configured crawler in the list is disallowed at once. At that point the finding collapses from many warnings into one high-confidence error, since total invisibility to answer engines is a single site-wide decision, whether you made it on purpose or by accident.
Does blocking GPTBot also block ChatGPT browsing or hurt my Google ranking?
GPTBot and ChatGPT-User are separate user-agents — GPTBot is OpenAI's training and indexing crawler, ChatGPT-User fetches a page a user explicitly asked about. The rule checks both. And no, blocking AI crawlers does not touch classic Google rankings: Googlebot and Google-Extended are distinct agents, so you can block AI training while staying fully indexed for normal search.
How does a wildcard block affect a bot that has no rule of its own?
If a crawler has no `User-agent:` block naming it, it falls back to the `User-agent: *` block. So a wildcard `Disallow: /` counts as blocking every AI crawler that lacks its own entry. This is the most common accidental block — give any bot you want to allow its own narrower block, and it escapes the wildcard rather than inheriting the root disallow.
Will a narrow Disallow like /admin/ trigger the rule?
No. The rule only counts a crawler as blocked when its effective rule contains a root `Disallow: /` or `Disallow: /*`. A bot with a narrower block such as `Disallow: /unreleased-campaign/` is treated as allowed, because the rest of your site is still readable. You can hide drafts and private sections from AI crawlers without ever tripping a finding.

Related rules

Want to know whether this rule actually fires on your site?

Run pseolint against your sitemap. The audit is free, takes about a minute, and returns a per-URL list of every rule that fired — including this one — with the exact metric values so you can prioritise the fix queue.