AI Crawler Access Checklist: 8 Steps for Site Owners
A practical checklist to get AI crawler access right: audit current access, set a robots.txt policy, reconcile your CDN/WAF, verify by IP range, decide on llms.txt, monitor, and refresh quarterly.
AI Crawler Access Checklist: 8 Steps for Site Owners
By Andrej Ruckij · June 17, 2026
TL;DR: Get AI crawler access right in eight steps: (1) audit what’s currently allowed/blocked, (2) decide your policy by category, (3) write robots.txt to block training and allow search, (4) reconcile your CDN/WAF so it doesn’t override you, (5) verify you’re reachable by search bots, (6) handle non-compliant scrapers at the firewall, (7) optionally add llms.txt, (8) monitor and refresh quarterly.
A cluster under the complete guide to AI crawler access control. This is the do-it list — work top to bottom.
1. Audit what’s happening now
Before changing anything, see the current state:
- Check your server logs / analytics for AI user-agents and their volume (how to read this).
- Check your CDN dashboard — many (e.g. Cloudflare) now show AI-bot traffic and may already be blocking by default (does-cloudflare-block-ai-crawlers).
- Note whether you’re currently in or out of AI answers (a quick test: ask ChatGPT/Perplexity about your brand).
2. Decide your policy by category
Resolve the one real question — is your content your product or your marketing? — then apply the framework (block-or-allow-ai-crawlers):
- Block training bots (opt out of free model training).
- Allow retrieval/search bots (they cite you).
- Never block user-fetch bots (they’re visitors).
3. Write the robots.txt
Use the verified tokens (directory). A typical marketing-site policy:
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
4. Reconcile your CDN / WAF
This is the step most people skip. A managed “block all AI” rule at your CDN overrides your robots.txt Allow (robots-txt-vs-waf-ai-bots). Open your CDN’s bot settings and make sure the search bots you allowed in robots.txt aren’t blocked at the edge.
5. Verify you’re reachable by search bots
Confirm the bots you want can actually reach you. A UA-spoofing audit catches CDN/WAF hard-blocks that standard tools miss (tools/ai-visibility-audit). If OAI-SearchBot or PerplexityBot can’t fetch you, you’re invisible in AI search.
6. Handle non-compliant scrapers at the firewall
robots.txt won’t stop scrapers that ignore it (can-robots-txt-stop-ai-scrapers). For aggressive or stealth crawlers (the Bytespider type), add firewall + IP/ASN rules, and verify legitimate bots by published IP range — not user-agent name.
7. Optionally add llms.txt
If you want, add a curated llms.txt to help AI understand your site (how-to). Keep expectations low — it’s advisory and inconsistently adopted (does-llms-txt-work). It’s a “why not,” not a priority.
8. Monitor and refresh quarterly
AI crawler tokens, IP ranges, and CDN defaults change constantly. Set a quarterly review: re-check the directory for new/renamed bots, confirm your blocks still match current tokens, and re-verify reachability. A stale block list is a silent liability.
Key takeaways
- Audit first, then set a category-based policy in robots.txt.
- Reconcile your CDN/WAF — it overrides robots.txt — and verify search bots can actually reach you.
- Enforce against non-compliant scrapers at the firewall, verifying by IP range.
- llms.txt is optional; monitoring and a quarterly refresh are not.
Related articles
- ai-crawler-access-control-guide — the parent guide
- block-or-allow-ai-crawlers — the policy decision (step 2)
- ai-crawler-user-agents-directory — the tokens for step 3
- robots-txt-vs-waf-ai-bots — steps 4 and 6
- tools/ai-visibility-audit — step 5 verification
Sources
- seo/ai-crawler-access — internal synthesis
- OpenAI — Bots / Crawlers documentation