AI Crawler Access Control: The Complete Guide (2026)

By Andrej Ruckij · June 17, 2026

TL;DR: AI crawler access control is deciding which AI bots may read your site, and enforcing that decision. The whole field reduces to one rule: classify each bot as training (block to opt out — it gives nothing back), retrieval/search (allow — it cites you), or user-fetch (never block — it’s a visitor). The one judgment call is whether to block training, and that hinges on whether your content is your product or your marketing. And the load-bearing technical truth: robots.txt only asks — a firewall enforces.

What you’ll learn

What AI crawler access control is and why it matters in 2026
The three bot types that drive every decision
How to decide block vs allow for your business
How to set it up in robots.txt — and why that isn’t enough on its own
Where llms.txt, costs, and regulation fit
A step-by-step checklist to do it right

This is the front-door guide; each section links to a deeper article. For the concise wiki reference, see seo/ai-crawler-access.

What AI crawler access control is

AI crawlers are bots that fetch your pages for AI systems — to train models, to build citation indexes for AI search, or to fetch a page a user asked about. Access control is the practice of deciding which of those you permit, and making the decision stick. It matters more every quarter: AI answers are becoming a primary discovery channel, AI training is a contested use of your content, and — since Cloudflare began blocking AI crawlers by default in July 2025 — the decision is increasingly being made for you at the infrastructure layer if you don’t make it yourself.

The three bot types (the foundation)

Every decision flows from one taxonomy (glossary/ai-crawler, mechanics in how-ai-crawlers-work):

Type	What it does	Gives back?	Default
Training	Builds model weights from your content	Nothing	Block to opt out
Retrieval / search	Cites you in AI answers	Citations + traffic	Allow
User-fetch	Opens a page a real user asked about	A visitor	Never block

“Block all AI” is a mistake because it collapses three very different transactions into one. The per-type policy is laid out in which-ai-bots-to-block, and every bot’s token is in the crawler directory.

The decision: block or allow?

Two of the three types are easy (always allow search and user-fetch). The entire decision reduces to whether to block training, and that turns on one question: is your content your product or your marketing?

Content is your product (publishers, paywalled research) → lean block; training cannibalizes you. This is the publisher logic.
Content is your marketing (most SaaS, ecommerce, services) → lean allow; AI visibility is an asset and over-blocking costs you discovery (what-you-lose-blocking-ai-search-bots).

The full framework, including the two opposing costs, is in block-or-allow-ai-crawlers.

Setting it up in robots.txt

A typical marketing-site policy (full tokens: ai-crawler-user-agents-directory):

# Block training
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /

# Allow the bots that cite you
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /

Per-vendor specifics differ — OpenAI (openai-crawlers), Anthropic (anthropic-crawlers), Google (google-ai-crawlers, where Google-Extended is an opt-out token, not a crawler), Perplexity (perplexity-crawlers), and Meta/Amazon (meta-amazon-ai-crawlers).

Why robots.txt isn’t enough

The single most important technical fact: robots.txt only asks; it doesn’t enforce. It governs bots that choose to comply. A managed firewall (WAF) rule, by contrast, blocks at the edge and overrides robots.txt — so a “block all AI” CDN rule will keep out the search bots you tried to allow. And non-compliant scrapers (the Bytespider type) ignore robots.txt entirely. The full treatment is in robots-txt-vs-waf-ai-bots; the short version is can-robots-txt-stop-ai-scrapers. Practical consequences:

Reconcile your CDN/WAF with your robots.txt, or the firewall silently wins.
Verify by IP range, since user-agent strings are spoofable.
Verify you’re reachable by the search bots you allowed (tools/ai-visibility-audit).

Where llms.txt fits

llms.txt is a curated markdown map of your site for AI — a comprehension aid, not access control, and inconsistently adopted (llms-txt-complete-guide, honest assessment in does-llms-txt-work). Add it as a cheap “why not”; don’t mistake it for a control or a ranking lever.

Costs and regulation

Two further dimensions round out the picture:

Cost: AI crawling consumes bandwidth and server load, and training crawlers give little back (the crawl-to-referral asymmetry). Measure before acting — ai-crawler-traffic-impact.
Regulation: in the EU, the AI Act is making machine-readable opt-outs (robots.txt) legally meaningful for training; the UK dropped its opt-out plan in March 2026. See ai-crawler-regulation-eu-uk.

Do it right: the checklist

The eight-step version — audit, decide, write robots.txt, reconcile CDN/WAF, verify reachability, firewall the scrapers, optionally add llms.txt, monitor quarterly — is in ai-crawler-access-checklist. Work it top to bottom.

Common questions

Q: Will blocking GPTBot remove me from ChatGPT? A: No — that’s OAI-SearchBot, a separate bot. See gptbot-vs-oai-searchbot.
Q: Does blocking AI bots hurt my Google SEO? A: No — AI training bots are separate from Googlebot. See does-blocking-ai-bots-hurt-seo.
Q: Is Cloudflare already deciding this for me? A: Possibly — it blocks AI crawlers by default since July 2025. See does-cloudflare-block-ai-crawlers.
Q: Do AI crawlers respect robots.txt? A: Reputable ones do; many don’t. See do-ai-crawlers-respect-robots-txt.

Key takeaways

Classify every bot as training (block), retrieval (allow), or user-fetch (never block).
The only real decision is training — and it hinges on content-as-product vs content-as-marketing.
robots.txt asks; a WAF enforces and overrides it — reconcile your layers and verify by IP range.
llms.txt is a comprehension aid, not access control; costs and EU regulation add further weight to a deliberate policy.
Use the checklist, then refresh quarterly — tokens and CDN defaults drift.

which-ai-bots-to-block — the per-category policy
block-or-allow-ai-crawlers — the decision framework
ai-crawler-user-agents-directory — every bot’s token + recommendation
robots-txt-vs-waf-ai-bots — why a firewall enforces where robots.txt asks
how-to-block-ai-scrapers — the full enforcement stack for non-compliant scrapers (WAF, IP/ASN, tarpits)
how-ai-crawlers-work · ai-crawler-traffic-impact · ai-crawler-access-checklist — mechanics, cost, and the do-it list
llms-txt-complete-guide — the comprehension-aid file
publishers-blocking-ai · ai-crawler-regulation-eu-uk — landscape and law
seo/ai-crawler-access — the concise wiki reference
glossary/ai-crawler — the foundational definition

Sources

seo/ai-crawler-access — internal synthesis (taxonomy, enforcement, UA tables)
OpenAI — Bots / Crawlers documentation
Cloudflare — Block AI crawlers by default + pay-per-crawl (Jul 2025)