AI Crawler Access Control: The Complete Guide (2026)

Everything site owners need on AI crawler access: the training/retrieval/user-fetch taxonomy, the block-or-allow decision, how to set up robots.txt, why a WAF enforces where robots.txt asks, llms.txt, costs, and regulation.

By Andrej Ruckij · · 7 min read

AI Crawler Access Control: The Complete Guide (2026)

By Andrej Ruckij · June 17, 2026

TL;DR: AI crawler access control is deciding which AI bots may read your site, and enforcing that decision. The whole field reduces to one rule: classify each bot as training (block to opt out — it gives nothing back), retrieval/search (allow — it cites you), or user-fetch (never block — it’s a visitor). The one judgment call is whether to block training, and that hinges on whether your content is your product or your marketing. And the load-bearing technical truth: robots.txt only asks — a firewall enforces.

What you’ll learn

  • What AI crawler access control is and why it matters in 2026
  • The three bot types that drive every decision
  • How to decide block vs allow for your business
  • How to set it up in robots.txt — and why that isn’t enough on its own
  • Where llms.txt, costs, and regulation fit
  • A step-by-step checklist to do it right

This is the front-door guide; each section links to a deeper article. For the concise wiki reference, see seo/ai-crawler-access.

What AI crawler access control is

AI crawlers are bots that fetch your pages for AI systems — to train models, to build citation indexes for AI search, or to fetch a page a user asked about. Access control is the practice of deciding which of those you permit, and making the decision stick. It matters more every quarter: AI answers are becoming a primary discovery channel, AI training is a contested use of your content, and — since Cloudflare began blocking AI crawlers by default in July 2025 — the decision is increasingly being made for you at the infrastructure layer if you don’t make it yourself.

The three bot types (the foundation)

Every decision flows from one taxonomy (glossary/ai-crawler, mechanics in how-ai-crawlers-work):

TypeWhat it doesGives back?Default
TrainingBuilds model weights from your contentNothingBlock to opt out
Retrieval / searchCites you in AI answersCitations + trafficAllow
User-fetchOpens a page a real user asked aboutA visitorNever block

“Block all AI” is a mistake because it collapses three very different transactions into one. The per-type policy is laid out in which-ai-bots-to-block, and every bot’s token is in the crawler directory.

The decision: block or allow?

Two of the three types are easy (always allow search and user-fetch). The entire decision reduces to whether to block training, and that turns on one question: is your content your product or your marketing?

  • Content is your product (publishers, paywalled research) → lean block; training cannibalizes you. This is the publisher logic.
  • Content is your marketing (most SaaS, ecommerce, services) → lean allow; AI visibility is an asset and over-blocking costs you discovery (what-you-lose-blocking-ai-search-bots).

The full framework, including the two opposing costs, is in block-or-allow-ai-crawlers.

Setting it up in robots.txt

A typical marketing-site policy (full tokens: ai-crawler-user-agents-directory):

# Block training
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /

# Allow the bots that cite you
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /

Per-vendor specifics differ — OpenAI (openai-crawlers), Anthropic (anthropic-crawlers), Google (google-ai-crawlers, where Google-Extended is an opt-out token, not a crawler), Perplexity (perplexity-crawlers), and Meta/Amazon (meta-amazon-ai-crawlers).

Why robots.txt isn’t enough

The single most important technical fact: robots.txt only asks; it doesn’t enforce. It governs bots that choose to comply. A managed firewall (WAF) rule, by contrast, blocks at the edge and overrides robots.txt — so a “block all AI” CDN rule will keep out the search bots you tried to allow. And non-compliant scrapers (the Bytespider type) ignore robots.txt entirely. The full treatment is in robots-txt-vs-waf-ai-bots; the short version is can-robots-txt-stop-ai-scrapers. Practical consequences:

  • Reconcile your CDN/WAF with your robots.txt, or the firewall silently wins.
  • Verify by IP range, since user-agent strings are spoofable.
  • Verify you’re reachable by the search bots you allowed (tools/ai-visibility-audit).

Where llms.txt fits

llms.txt is a curated markdown map of your site for AI — a comprehension aid, not access control, and inconsistently adopted (llms-txt-complete-guide, honest assessment in does-llms-txt-work). Add it as a cheap “why not”; don’t mistake it for a control or a ranking lever.

Costs and regulation

Two further dimensions round out the picture:

  • Cost: AI crawling consumes bandwidth and server load, and training crawlers give little back (the crawl-to-referral asymmetry). Measure before acting — ai-crawler-traffic-impact.
  • Regulation: in the EU, the AI Act is making machine-readable opt-outs (robots.txt) legally meaningful for training; the UK dropped its opt-out plan in March 2026. See ai-crawler-regulation-eu-uk.

Do it right: the checklist

The eight-step version — audit, decide, write robots.txt, reconcile CDN/WAF, verify reachability, firewall the scrapers, optionally add llms.txt, monitor quarterly — is in ai-crawler-access-checklist. Work it top to bottom.

Common questions

Key takeaways

  • Classify every bot as training (block), retrieval (allow), or user-fetch (never block).
  • The only real decision is training — and it hinges on content-as-product vs content-as-marketing.
  • robots.txt asks; a WAF enforces and overrides it — reconcile your layers and verify by IP range.
  • llms.txt is a comprehension aid, not access control; costs and EU regulation add further weight to a deliberate policy.
  • Use the checklist, then refresh quarterly — tokens and CDN defaults drift.

Sources