Which AI Bots Should You Block? (And Why robots.txt Won’t Stop Them)

By Andrej Ruckij · June 16, 2026

TL;DR: AI bots aren’t one thing. Training bots turn your content into model weights and give nothing back; retrieval/search bots turn it into cited answers with a link to you; user-fetch bots fire when a real person asks an AI to open your page — that’s a visitor, never block it. The sensible default: block training, allow the other two. But the trap is enforcement: robots.txt only asks well-behaved bots to stay away — a firewall (WAF) is what actually blocks a request. If you genuinely need a bot out, the txt file won’t do it.

“Should I block AI from my site?” is the wrong question. The right one is “block which AI, doing what?” — because the bot crawling you to train a model is a completely different animal from the bot fetching your page because a customer just asked ChatGPT about you. Get the categories right and the policy is easy. Get the enforcement right and you avoid the most common mistake: thinking robots.txt is a lock when it’s a polite sign.

Three kinds of bot, three different deals

Category	What it does	What you get back	Default move
Training	Fetches your content to build/refine model weights	Nothing — no link, no traffic	Block (unless you want to train models for free)
Retrieval / search	Fetches to build cited AI-search answers	Citations + referral clicks	Allow
User-fetch	Fires when a real person asks the AI to open your page	A live visitor with intent	Never block

That third category trips people up. When someone pastes your URL into ChatGPT and asks “is this legit?”, a ChatGPT-User request hits your server. Block it and you’ve blocked a customer mid-decision. These aren’t crawlers harvesting you — they’re your traffic.

The clean rule: Disallow training, Allow retrieval + user-fetch. It only works cleanly where a vendor exposes separate tokens for training vs. search (OpenAI and Anthropic do — you can block GPTBot while keeping OAI-SearchBot). Where a vendor blurs the two, you’re trading some citation visibility for training refusal — a real tradeoff, not a free lunch.

The part nobody tells you: robots.txt doesn’t enforce anything

This is the load-bearing point. robots.txt is a voluntary honor system. Per the spec (RFC 9309), compliance is optional and “does not constitute access control.” It’s a Code of Conduct sign at the pool — a polite bot reads it and behaves; nothing forces it to.

A WAF (web application firewall) enforces. A blocked request gets a 403 at the edge, before it ever reaches your site — and the firewall never even consults robots.txt. They live at different layers: robots.txt requests, the firewall acts. If both exist, the firewall wins by construction.

Which means: robots.txt only governs the bots that choose to obey it. The ones you’d most want to stop — scrapers that spoof their user-agent, rotate IP addresses, and publish no verifiable ranges — ignore it entirely. For those, your only real tool is a firewall rule plus IP blocking. And because a user-agent string is trivially faked, pair any rule you actually care about with the vendor’s published IP ranges (OpenAI, Anthropic, Perplexity all publish them now) rather than UA-matching alone.

The cautionary tale

In August 2025, Cloudflare reported catching Perplexity crawling sites that had explicitly blocked it — rotating user-agents and networks, using a generic “Chrome on macOS” identity to slip past. Cloudflare de-listed it as a verified bot; Perplexity denied it. Whatever the merits, the lesson is clean: a polite directive only works on the polite. Enforcement is a firewall problem.

A current cheat-sheet (mid-2026 — expect drift)

Exact spelling matters for robots.txt rules. Re-check vendor docs before deploying.

Training (usually block): GPTBot (OpenAI), ClaudeBot (Anthropic — now publishes IP ranges), CCBot (Common Crawl), Amazonbot, Meta-ExternalAgent. Plus two opt-out tokens that aren’t crawlers — Google-Extended and Applebot-Extended make no requests; they only tell Google/Apple not to use what they already fetched for training.
Retrieval / search (usually allow): OAI-SearchBot (OpenAI), Claude-SearchBot (Anthropic), PerplexityBot.
User-fetch (never block): ChatGPT-User, Claude-User, Perplexity-User (note: Perplexity-User ignores robots.txt by design, because it’s user-triggered).

The bigger shift: access control is moving to the edge

The whole question is migrating off your robots.txt and onto your infrastructure. In July 2025, Cloudflare began blocking AI crawlers by default for its customers and launched pay-per-crawl — using the old HTTP 402 Payment Required code to let sites charge bots for access. With Cloudflare fronting a large share of the web, the real control point increasingly sits at the CDN, not in a text file in your repo. And llms.txt, the proposed “map your site for LLMs” standard, is advisory too — useful for helping cooperative AI understand you, useless as a lock.

What to do now

Decide your stance per category, not per bot: block training, allow retrieval + user-fetch.
Put the policy in robots.txt for the well-behaved majority — it’s still worth doing.
Enforce the cases you actually care about at the firewall/CDN, verified by published IP ranges, not user-agent strings.
Never block user-fetch agents — you’d be blocking customers.
Check what your CDN already does by default (Cloudflare may be blocking or monetizing AI access for you).

Honest caveat

User-agent tokens, IP-range publication, and platform defaults change frequently — this snapshot is June 2026. What won’t change is the principle: robots.txt is a request, a firewall is enforcement, and the smartest policy is per-category, not all-or-nothing.

Key takeaways

Three bot types, three deals: training (block — nothing back), retrieval/search (allow — citations + clicks), user-fetch (never block — it’s a visitor).
robots.txt only asks; a WAF enforces. It governs polite bots only; spoofing scrapers ignore it.
UA strings are spoofable — enforce with vendor-published IP ranges for anything that matters.
Google-Extended / Applebot-Extended are opt-out signals, not crawlers.
Access control is moving to the edge: Cloudflare default-blocks AI bots and offers HTTP-402 pay-per-crawl; llms.txt is advisory, not a lock.

robots-txt-vs-waf-ai-bots — The enforcement deep-dive: why robots.txt only asks while a firewall actually blocks
seo/ai-crawler-access — The full reference: taxonomy, enforcement ordering, and the current user-agent tables
seo/ai-visibility — The flip side: getting found by the bots you allow (you can’t be cited by a crawler you’ve blocked)
tools/ai-visibility-audit — An audit that catches the WAF/CDN blocks invisible to standard SEO tools
seo/zero-click-strategy — Why allowing retrieval bots matters in a citation-first search world

Sources

seo/ai-crawler-access — internal synthesis with the full taxonomy, enforcement detail, and UA tables
OpenAI — Bots / Crawlers documentation — GPTBot, OAI-SearchBot, ChatGPT-User + IP-range files
Anthropic — Does Anthropic crawl the web, and how to block it
Cloudflare — Block AI crawlers by default + pay-per-crawl (Jul 2025)
Malwarebytes — Perplexity ignores no-crawl rules (Aug 2025)