Which AI Bots Should You Block? (And Why robots.txt Won't Stop Them)

A plain-English guide to AI crawler access: the training vs. retrieval vs. user-fetch bot taxonomy, which to allow or block, and why a firewall enforces where robots.txt only asks.

By Andrej Ruckij · · 6 min read

Which AI Bots Should You Block? (And Why robots.txt Won’t Stop Them)

By Andrej Ruckij · June 16, 2026

TL;DR: AI bots aren’t one thing. Training bots turn your content into model weights and give nothing back; retrieval/search bots turn it into cited answers with a link to you; user-fetch bots fire when a real person asks an AI to open your page — that’s a visitor, never block it. The sensible default: block training, allow the other two. But the trap is enforcement: robots.txt only asks well-behaved bots to stay away — a firewall (WAF) is what actually blocks a request. If you genuinely need a bot out, the txt file won’t do it.

“Should I block AI from my site?” is the wrong question. The right one is “block which AI, doing what?” — because the bot crawling you to train a model is a completely different animal from the bot fetching your page because a customer just asked ChatGPT about you. Get the categories right and the policy is easy. Get the enforcement right and you avoid the most common mistake: thinking robots.txt is a lock when it’s a polite sign.

Three kinds of bot, three different deals

CategoryWhat it doesWhat you get backDefault move
TrainingFetches your content to build/refine model weightsNothing — no link, no trafficBlock (unless you want to train models for free)
Retrieval / searchFetches to build cited AI-search answersCitations + referral clicksAllow
User-fetchFires when a real person asks the AI to open your pageA live visitor with intentNever block

That third category trips people up. When someone pastes your URL into ChatGPT and asks “is this legit?”, a ChatGPT-User request hits your server. Block it and you’ve blocked a customer mid-decision. These aren’t crawlers harvesting you — they’re your traffic.

The clean rule: Disallow training, Allow retrieval + user-fetch. It only works cleanly where a vendor exposes separate tokens for training vs. search (OpenAI and Anthropic do — you can block GPTBot while keeping OAI-SearchBot). Where a vendor blurs the two, you’re trading some citation visibility for training refusal — a real tradeoff, not a free lunch.

The part nobody tells you: robots.txt doesn’t enforce anything

This is the load-bearing point. robots.txt is a voluntary honor system. Per the spec (RFC 9309), compliance is optional and “does not constitute access control.” It’s a Code of Conduct sign at the pool — a polite bot reads it and behaves; nothing forces it to.

A WAF (web application firewall) enforces. A blocked request gets a 403 at the edge, before it ever reaches your site — and the firewall never even consults robots.txt. They live at different layers: robots.txt requests, the firewall acts. If both exist, the firewall wins by construction.

Which means: robots.txt only governs the bots that choose to obey it. The ones you’d most want to stop — scrapers that spoof their user-agent, rotate IP addresses, and publish no verifiable ranges — ignore it entirely. For those, your only real tool is a firewall rule plus IP blocking. And because a user-agent string is trivially faked, pair any rule you actually care about with the vendor’s published IP ranges (OpenAI, Anthropic, Perplexity all publish them now) rather than UA-matching alone.

The cautionary tale

In August 2025, Cloudflare reported catching Perplexity crawling sites that had explicitly blocked it — rotating user-agents and networks, using a generic “Chrome on macOS” identity to slip past. Cloudflare de-listed it as a verified bot; Perplexity denied it. Whatever the merits, the lesson is clean: a polite directive only works on the polite. Enforcement is a firewall problem.

A current cheat-sheet (mid-2026 — expect drift)

Exact spelling matters for robots.txt rules. Re-check vendor docs before deploying.

  • Training (usually block): GPTBot (OpenAI), ClaudeBot (Anthropic — now publishes IP ranges), CCBot (Common Crawl), Amazonbot, Meta-ExternalAgent. Plus two opt-out tokens that aren’t crawlersGoogle-Extended and Applebot-Extended make no requests; they only tell Google/Apple not to use what they already fetched for training.
  • Retrieval / search (usually allow): OAI-SearchBot (OpenAI), Claude-SearchBot (Anthropic), PerplexityBot.
  • User-fetch (never block): ChatGPT-User, Claude-User, Perplexity-User (note: Perplexity-User ignores robots.txt by design, because it’s user-triggered).

The bigger shift: access control is moving to the edge

The whole question is migrating off your robots.txt and onto your infrastructure. In July 2025, Cloudflare began blocking AI crawlers by default for its customers and launched pay-per-crawl — using the old HTTP 402 Payment Required code to let sites charge bots for access. With Cloudflare fronting a large share of the web, the real control point increasingly sits at the CDN, not in a text file in your repo. And llms.txt, the proposed “map your site for LLMs” standard, is advisory too — useful for helping cooperative AI understand you, useless as a lock.

What to do now

  1. Decide your stance per category, not per bot: block training, allow retrieval + user-fetch.
  2. Put the policy in robots.txt for the well-behaved majority — it’s still worth doing.
  3. Enforce the cases you actually care about at the firewall/CDN, verified by published IP ranges, not user-agent strings.
  4. Never block user-fetch agents — you’d be blocking customers.
  5. Check what your CDN already does by default (Cloudflare may be blocking or monetizing AI access for you).

Honest caveat

User-agent tokens, IP-range publication, and platform defaults change frequently — this snapshot is June 2026. What won’t change is the principle: robots.txt is a request, a firewall is enforcement, and the smartest policy is per-category, not all-or-nothing.

Key takeaways

  • Three bot types, three deals: training (block — nothing back), retrieval/search (allow — citations + clicks), user-fetch (never block — it’s a visitor).
  • robots.txt only asks; a WAF enforces. It governs polite bots only; spoofing scrapers ignore it.
  • UA strings are spoofable — enforce with vendor-published IP ranges for anything that matters.
  • Google-Extended / Applebot-Extended are opt-out signals, not crawlers.
  • Access control is moving to the edge: Cloudflare default-blocks AI bots and offers HTTP-402 pay-per-crawl; llms.txt is advisory, not a lock.

Sources