Why robots.txt Won’t Block AI Bots (and What Actually Does)

By Andrej Ruckij · June 16, 2026

TL;DR: robots.txt does not block anything — it asks well-behaved bots to stay away, and the badly-behaved ones ignore it. A WAF (web application firewall) enforces: it returns a 403 at the edge before the request reaches your site, and it never consults robots.txt. If a managed firewall rule and your robots.txt disagree, the firewall wins. So the real setup is layered: robots.txt for the compliant majority, a firewall for the cases you actually need to stop — backed by IP-range verification, because user-agent strings are trivially faked.

Most “how to block AI bots” guides hand you a robots.txt snippet and call it done. That snippet is worth adding — but if you believe it’s keeping AI crawlers out, you’ve mistaken a sign for a lock. This is the access-control side of AI crawler access control, and it’s the part nearly every guide skips. Understanding it is the difference between “I told the bots not to” and “the bots can’t.”

robots.txt is a request, not a barrier

robots.txt is a voluntary standard. The spec that defines it (IETF RFC 9309) is explicit: compliance is voluntary and “does not constitute a form of access control.” It’s a Code of Conduct sign posted at a community pool — a polite swimmer reads it and behaves; nothing physically stops anyone who doesn’t.

In practice that means robots.txt governs exactly one population: the bots that choose to obey it. Reputable operators — OpenAI, Anthropic, Google — do honor it. So a Disallow: GPTBot line genuinely stops GPTBot, because OpenAI’s crawler reads the file and complies.

But two things break the illusion of control:

Non-compliant scrapers ignore it entirely. Anything that spoofs its user-agent, rotates IP addresses, and publishes no verifiable range will read your Disallow and keep going — or never request the file at all.
robots.txt has no teeth even against compliant bots that misread intent. It’s a parsing convention, not an enforcement mechanism. Nothing rejects the request; the bot self-polices.

If your goal is “fewer of my words in training sets, from companies that play fair,” robots.txt is the right tool. If your goal is “keep this content out, full stop,” it can’t deliver that on its own.

A WAF enforces — and it wins

A web application firewall sits in front of your origin and inspects requests before they reach it. A request that matches a block rule gets a 403 Forbidden (or is silently dropped) at the edge. The firewall makes that decision on its own — it never reads your robots.txt.

That leads to the single most important — and most under-known — fact in AI bot access control:

If a managed WAF rule blocks “all AI crawlers” while your robots.txt says Allow: OAI-SearchBot, the WAF wins. The firewall acts first, at the network edge; robots.txt is just a file a bot may or may not fetch afterward. You will be excluded from ChatGPT’s search even though your robots.txt invited it in.

People get tripped up trying to imagine the two being “evaluated in order,” as if one config is parsed before another. They’re not in the same system at all. robots.txt lives in your site and depends on the bot’s cooperation. The WAF lives at the edge and depends on nothing. When both exist, the firewall result is the one that happens — by construction.

The practical takeaway: reconcile your layers. If you (or your CDN) have a managed “block AI bots” rule switched on, check it against your robots.txt intentions before assuming your search crawlers are getting through. A lot of accidental AI-invisibility in 2026 comes from a default firewall rule quietly overriding a carefully written robots.txt.

What this looks like in practice

A real, layered setup for a site that wants to block training but stay visible in AI search:

# robots.txt — governs the compliant majority
User-agent: GPTBot          # OpenAI training
Disallow: /
User-agent: CCBot           # Common Crawl (feeds many models)
Disallow: /
User-agent: Google-Extended  # opt-out token: no Gemini/Vertex training use
Disallow: /

User-agent: OAI-SearchBot   # ChatGPT search — allow (drives citations)
Allow: /
User-agent: ChatGPT-User    # a real visitor opened your page — never block
Allow: /

That file handles OpenAI, Common Crawl, and Google’s training opt-out — all compliant. It does nothing about a scraper rotating user-agents from a residential IP pool. For that you need a firewall rule (or a managed AI-bot ruleset at your CDN) plus IP/ASN blocking. The two layers do different jobs: robots.txt expresses intent to the polite; the WAF imposes reality on the rest.

The cautionary tale: when a polite directive isn’t honored

In August 2025, Cloudflare reported catching Perplexity crawling sites that had explicitly disallowed it — rotating user-agents and source networks, and using a generic “Chrome on macOS” identity to slip past blocks tied to its declared crawler. Cloudflare de-listed Perplexity from its verified-bots list and added heuristics to catch the behavior; Perplexity disputed the findings.

Whatever the merits of that specific dispute, the lesson is structural and not about any one company: a directive only works on those who choose to follow it. The moment a crawler decides not to, your robots.txt is a suggestion it has declined. The only response that works is enforcement at the edge.

User-agent strings are not identity

There’s a second trap inside the firewall layer: blocking by user-agent string alone is weak, because a UA is just a header any client can set. A scraper can call itself Googlebot in one request and Mozilla/5.0 in the next.

So for anything you genuinely need to control, verify the bot, don’t trust its name:

The major operators now publish IP-range files (OpenAI, Anthropic, and Perplexity all do). A request claiming to be GPTBot from outside OpenAI’s published ranges is not GPTBot.
Reverse-DNS verification and ASN-level rules catch what UA-matching misses.
Behavioral heuristics (request velocity, path patterns) flag the stealth crawlers that pass every static check.

UA-matching in robots.txt or a simple firewall rule is fine for the honest bots — they identify themselves truthfully. It’s the dishonest ones you’re trying to stop, and those are exactly the ones a name-based rule can’t catch.

Where access control is actually heading

The control point is migrating off your site entirely. In July 2025, Cloudflare began blocking AI crawlers by default for its customers and introduced pay-per-crawl — reviving the dormant HTTP 402 Payment Required status code so sites can charge bots for access via crawler-price headers. With a large share of the web sitting behind Cloudflare, “should I block this bot?” is increasingly a setting at the CDN, not a line in a repo.

And to be clear about the file people reach for next: llms.txt is advisory too. It’s a useful way to help cooperative AI understand your site’s structure, but it carries the same honor-system limitation as robots.txt and is not consistently honored. It is not access control. (For the full taxonomy and current user-agent tables, see seo/ai-crawler-access.)

Key takeaways

robots.txt asks; a WAF enforces. The file governs only bots that choose to comply.
The firewall wins. A managed “block AI bots” rule overrides any Allow in robots.txt — reconcile the two before assuming your search crawlers get through.
Non-compliant scrapers ignore robots.txt — the Perplexity stealth-crawl case (Aug 2025) is the textbook example. Enforcement is a firewall problem.
User-agent strings are spoofable — verify with published IP ranges, reverse DNS, and behavioral signals, not the bot’s self-reported name.
Use both layers on purpose: robots.txt for the compliant majority, WAF + IP rules for the cases you must stop. And llms.txt is advisory, not a lock.

which-ai-bots-to-block — the companion: which AI bots to allow vs block, by category
seo/ai-crawler-access — the full reference: bot taxonomy, enforcement ordering, current user-agent tables
tools/ai-visibility-audit — an audit whose UA-spoofed fetches catch exactly the WAF/CDN hard-blocks described here
seo/ai-visibility — the other side: being found by the bots you allow

Sources

seo/ai-crawler-access — internal synthesis with the full taxonomy and enforcement detail
Cloudflare — Control content use for AI training — robots.txt-is-voluntary vs WAF-enforces
Cloudflare — Block AI crawlers by default + pay-per-crawl (Jul 2025)
Malwarebytes — Perplexity ignores no-crawl rules (Aug 2025)
IETF RFC 9309 — Robots Exclusion Protocol — “compliance is not required”