AI Crawler Access Checklist: 8 Steps for Site Owners

By Andrej Ruckij · June 17, 2026

TL;DR: Get AI crawler access right in eight steps: (1) audit what’s currently allowed/blocked, (2) decide your policy by category, (3) write robots.txt to block training and allow search, (4) reconcile your CDN/WAF so it doesn’t override you, (5) verify you’re reachable by search bots, (6) handle non-compliant scrapers at the firewall, (7) optionally add llms.txt, (8) monitor and refresh quarterly.

A cluster under the complete guide to AI crawler access control. This is the do-it list — work top to bottom.

1. Audit what’s happening now

Before changing anything, see the current state:

Check your server logs / analytics for AI user-agents and their volume (how to read this).
Check your CDN dashboard — many (e.g. Cloudflare) now show AI-bot traffic and may already be blocking by default (does-cloudflare-block-ai-crawlers).
Note whether you’re currently in or out of AI answers (a quick test: ask ChatGPT/Perplexity about your brand).

2. Decide your policy by category

Resolve the one real question — is your content your product or your marketing? — then apply the framework (block-or-allow-ai-crawlers):

Block training bots (opt out of free model training).
Allow retrieval/search bots (they cite you).
Never block user-fetch bots (they’re visitors).

3. Write the robots.txt

Use the verified tokens (directory). A typical marketing-site policy:

User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /

User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /

4. Reconcile your CDN / WAF

This is the step most people skip. A managed “block all AI” rule at your CDN overrides your robots.txt Allow (robots-txt-vs-waf-ai-bots). Open your CDN’s bot settings and make sure the search bots you allowed in robots.txt aren’t blocked at the edge.

5. Verify you’re reachable by search bots

Confirm the bots you want can actually reach you. A UA-spoofing audit catches CDN/WAF hard-blocks that standard tools miss (tools/ai-visibility-audit). If OAI-SearchBot or PerplexityBot can’t fetch you, you’re invisible in AI search.

6. Handle non-compliant scrapers at the firewall

robots.txt won’t stop scrapers that ignore it (can-robots-txt-stop-ai-scrapers). For aggressive or stealth crawlers (the Bytespider type), add firewall + IP/ASN rules, and verify legitimate bots by published IP range — not user-agent name.

7. Optionally add llms.txt

If you want, add a curated llms.txt to help AI understand your site (how-to). Keep expectations low — it’s advisory and inconsistently adopted (does-llms-txt-work). It’s a “why not,” not a priority.

8. Monitor and refresh quarterly

AI crawler tokens, IP ranges, and CDN defaults change constantly. Set a quarterly review: re-check the directory for new/renamed bots, confirm your blocks still match current tokens, and re-verify reachability. A stale block list is a silent liability.

Key takeaways

Audit first, then set a category-based policy in robots.txt.
Reconcile your CDN/WAF — it overrides robots.txt — and verify search bots can actually reach you.
Enforce against non-compliant scrapers at the firewall, verifying by IP range.
llms.txt is optional; monitoring and a quarterly refresh are not.

ai-crawler-access-control-guide — the parent guide
block-or-allow-ai-crawlers — the policy decision (step 2)
ai-crawler-user-agents-directory — the tokens for step 3
robots-txt-vs-waf-ai-bots — steps 4 and 6
tools/ai-visibility-audit — step 5 verification

Sources

seo/ai-crawler-access — internal synthesis
OpenAI — Bots / Crawlers documentation