AI Crawler Access Control: The Complete Guide (2026)
Everything site owners need on AI crawler access: the training/retrieval/user-fetch taxonomy, the block-or-allow decision, how to set up robots.txt, why a WAF enforces where robots.txt asks, llms.txt, costs, and regulation.
AI Crawler Access Control: The Complete Guide (2026)
By Andrej Ruckij · June 17, 2026
TL;DR: AI crawler access control is deciding which AI bots may read your site, and enforcing that decision. The whole field reduces to one rule: classify each bot as training (block to opt out — it gives nothing back), retrieval/search (allow — it cites you), or user-fetch (never block — it’s a visitor). The one judgment call is whether to block training, and that hinges on whether your content is your product or your marketing. And the load-bearing technical truth: robots.txt only asks — a firewall enforces.
What you’ll learn
- What AI crawler access control is and why it matters in 2026
- The three bot types that drive every decision
- How to decide block vs allow for your business
- How to set it up in robots.txt — and why that isn’t enough on its own
- Where llms.txt, costs, and regulation fit
- A step-by-step checklist to do it right
This is the front-door guide; each section links to a deeper article. For the concise wiki reference, see seo/ai-crawler-access.
What AI crawler access control is
AI crawlers are bots that fetch your pages for AI systems — to train models, to build citation indexes for AI search, or to fetch a page a user asked about. Access control is the practice of deciding which of those you permit, and making the decision stick. It matters more every quarter: AI answers are becoming a primary discovery channel, AI training is a contested use of your content, and — since Cloudflare began blocking AI crawlers by default in July 2025 — the decision is increasingly being made for you at the infrastructure layer if you don’t make it yourself.
The three bot types (the foundation)
Every decision flows from one taxonomy (glossary/ai-crawler, mechanics in how-ai-crawlers-work):
| Type | What it does | Gives back? | Default |
|---|---|---|---|
| Training | Builds model weights from your content | Nothing | Block to opt out |
| Retrieval / search | Cites you in AI answers | Citations + traffic | Allow |
| User-fetch | Opens a page a real user asked about | A visitor | Never block |
“Block all AI” is a mistake because it collapses three very different transactions into one. The per-type policy is laid out in which-ai-bots-to-block, and every bot’s token is in the crawler directory.
The decision: block or allow?
Two of the three types are easy (always allow search and user-fetch). The entire decision reduces to whether to block training, and that turns on one question: is your content your product or your marketing?
- Content is your product (publishers, paywalled research) → lean block; training cannibalizes you. This is the publisher logic.
- Content is your marketing (most SaaS, ecommerce, services) → lean allow; AI visibility is an asset and over-blocking costs you discovery (what-you-lose-blocking-ai-search-bots).
The full framework, including the two opposing costs, is in block-or-allow-ai-crawlers.
Setting it up in robots.txt
A typical marketing-site policy (full tokens: ai-crawler-user-agents-directory):
# Block training
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
# Allow the bots that cite you
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
Per-vendor specifics differ — OpenAI (openai-crawlers), Anthropic (anthropic-crawlers), Google (google-ai-crawlers, where Google-Extended is an opt-out token, not a crawler), Perplexity (perplexity-crawlers), and Meta/Amazon (meta-amazon-ai-crawlers).
Why robots.txt isn’t enough
The single most important technical fact: robots.txt only asks; it doesn’t enforce. It governs bots that choose to comply. A managed firewall (WAF) rule, by contrast, blocks at the edge and overrides robots.txt — so a “block all AI” CDN rule will keep out the search bots you tried to allow. And non-compliant scrapers (the Bytespider type) ignore robots.txt entirely. The full treatment is in robots-txt-vs-waf-ai-bots; the short version is can-robots-txt-stop-ai-scrapers. Practical consequences:
- Reconcile your CDN/WAF with your robots.txt, or the firewall silently wins.
- Verify by IP range, since user-agent strings are spoofable.
- Verify you’re reachable by the search bots you allowed (tools/ai-visibility-audit).
Where llms.txt fits
llms.txt is a curated markdown map of your site for AI — a comprehension aid, not access control, and inconsistently adopted (llms-txt-complete-guide, honest assessment in does-llms-txt-work). Add it as a cheap “why not”; don’t mistake it for a control or a ranking lever.
Costs and regulation
Two further dimensions round out the picture:
- Cost: AI crawling consumes bandwidth and server load, and training crawlers give little back (the crawl-to-referral asymmetry). Measure before acting — ai-crawler-traffic-impact.
- Regulation: in the EU, the AI Act is making machine-readable opt-outs (robots.txt) legally meaningful for training; the UK dropped its opt-out plan in March 2026. See ai-crawler-regulation-eu-uk.
Do it right: the checklist
The eight-step version — audit, decide, write robots.txt, reconcile CDN/WAF, verify reachability, firewall the scrapers, optionally add llms.txt, monitor quarterly — is in ai-crawler-access-checklist. Work it top to bottom.
Common questions
- Q: Will blocking GPTBot remove me from ChatGPT? A: No — that’s
OAI-SearchBot, a separate bot. See gptbot-vs-oai-searchbot. - Q: Does blocking AI bots hurt my Google SEO? A: No — AI training bots are separate from Googlebot. See does-blocking-ai-bots-hurt-seo.
- Q: Is Cloudflare already deciding this for me? A: Possibly — it blocks AI crawlers by default since July 2025. See does-cloudflare-block-ai-crawlers.
- Q: Do AI crawlers respect robots.txt? A: Reputable ones do; many don’t. See do-ai-crawlers-respect-robots-txt.
Key takeaways
- Classify every bot as training (block), retrieval (allow), or user-fetch (never block).
- The only real decision is training — and it hinges on content-as-product vs content-as-marketing.
- robots.txt asks; a WAF enforces and overrides it — reconcile your layers and verify by IP range.
- llms.txt is a comprehension aid, not access control; costs and EU regulation add further weight to a deliberate policy.
- Use the checklist, then refresh quarterly — tokens and CDN defaults drift.
Related articles
- which-ai-bots-to-block — the per-category policy
- block-or-allow-ai-crawlers — the decision framework
- ai-crawler-user-agents-directory — every bot’s token + recommendation
- robots-txt-vs-waf-ai-bots — why a firewall enforces where robots.txt asks
- how-to-block-ai-scrapers — the full enforcement stack for non-compliant scrapers (WAF, IP/ASN, tarpits)
- how-ai-crawlers-work · ai-crawler-traffic-impact · ai-crawler-access-checklist — mechanics, cost, and the do-it list
- llms-txt-complete-guide — the comprehension-aid file
- publishers-blocking-ai · ai-crawler-regulation-eu-uk — landscape and law
- seo/ai-crawler-access — the concise wiki reference
- glossary/ai-crawler — the foundational definition
Sources
- seo/ai-crawler-access — internal synthesis (taxonomy, enforcement, UA tables)
- OpenAI — Bots / Crawlers documentation
- Cloudflare — Block AI crawlers by default + pay-per-crawl (Jul 2025)