The AI Crawler Directory: Every User-Agent, What It Does, Allow or Block (2026)
A complete reference table of AI crawler user-agents — GPTBot, ClaudeBot, PerplexityBot, Google-Extended, CCBot, Bytespider and more — with each bot's job, robots.txt compliance, IP-range file, and an allow/block recommendation.
The AI Crawler Directory: Every User-Agent, What It Does, Allow or Block (2026)
By Andrej Ruckij · June 17, 2026
TL;DR: This is the master reference for AI crawler user-agents as of mid-2026. The rule that organizes all of it: classify each bot as training (block to opt out — no traffic back), retrieval/search (allow — it cites you), or user-fetch (never block — it’s a visitor). Below: every major bot, its exact user-agent token, whether it respects robots.txt, its IP-range file for verification, and a recommendation. Treat the tokens as a dated layer — they drift; re-check quarterly.
What you’ll learn
- The exact user-agent token for every major AI crawler
- Which category each falls into (training / retrieval / user-fetch / opt-out token)
- Whether each respects robots.txt and where to verify it by IP range
- A clear allow/block recommendation per bot
- Which “bots” are actually opt-out tokens, not crawlers
How to read this directory
Three categories drive every decision (full taxonomy: glossary/ai-crawler):
- Training — turns your content into model weights; no citation, no traffic. Block to opt out.
- Retrieval / search — indexes you to cite in AI answers, with a link. Allow.
- User-fetch — fires when a real person asks an AI to open your page. Never block — it’s a visitor.
Two cautions before the table: user-agent strings are spoofable, so verify anything you care about against the operator’s published IP-range file; and robots.txt only governs compliant bots — non-compliant scrapers need a firewall (robots-txt-vs-waf-ai-bots).
Training crawlers (block to opt out of training)
| User-agent | Vendor | Respects robots.txt | IP ranges | Notes |
|---|---|---|---|---|
GPTBot | OpenAI | Yes | openai.com/gptbot.json | GPTBot/1.3. See openai-crawlers |
ClaudeBot | Anthropic | Yes | claude.com/crawling/bots.json | Anthropic now publishes ranges (2026). anthropic-crawlers |
CCBot | Common Crawl | Yes | — | Open dataset feeding many models. glossary/ccbot |
Meta-ExternalAgent | Meta | Yes | — | meta-externalagent/1.1; Meta AI / Llama. meta-amazon-ai-crawlers |
Amazonbot | Amazon | Yes | — | May feed AI; mind the Amzn-SearchBot sibling |
Google-Extended | Yes (token) | — | Opt-out token, not a crawler — see below. google-ai-crawlers | |
Applebot-Extended | Apple | Yes (token) | — | Opt-out token read by Applebot for Apple Intelligence training |
Retrieval / search crawlers (allow — they cite you)
| User-agent | Vendor | Respects robots.txt | IP ranges | Notes |
|---|---|---|---|---|
OAI-SearchBot | OpenAI | Yes | openai.com/searchbot.json | OAI-SearchBot/1.3. Allow to stay in ChatGPT |
Claude-SearchBot | Anthropic | Yes | claude.com/crawling/bots.json | Allow for Claude citations |
PerplexityBot | Perplexity | Yes | perplexity.com/perplexitybot.json | See the Aug-2025 caveat: perplexity-crawlers |
Google-CloudVertexBot | Yes | — | Real crawler (unlike Google-Extended); Cloud/Vertex |
User-fetch agents (never block — they’re visitors)
| User-agent | Vendor | Respects robots.txt | IP ranges | Notes |
|---|---|---|---|---|
ChatGPT-User | OpenAI | Yes | openai.com/chatgpt-user.json | Fires on a real user’s request |
Claude-User | Anthropic | Yes | claude.com/crawling/bots.json | Fires on a real user’s request |
Perplexity-User | Perplexity | No (by design) | perplexity.com/perplexity-user.json | User-triggered; ignores robots.txt |
Meta-ExternalFetcher | Meta | Yes | — | Real-time when a user asks Meta AI |
The “bots” that aren’t crawlers
Two of the most-discussed entries make no requests at all:
Google-Extended— a robots.txt opt-out token. It tells Google not to use already-crawled content for Gemini/Vertex training. Blocking it does not crawl-block anything and does not affect Google Search ranking. (google-ai-crawlers)Applebot-Extended— same idea for Apple Intelligence; a signal read by the existingApplebot.
Disallowing these changes data use, not access. Conflating opt-out tokens with crawlers is the most common AI-access mistake.
Other bots worth knowing
OAI-AdsBot(OpenAI) — validates ad landing pages. New in 2026.Amzn-SearchBot(Amazon) — a search-only sibling ofAmazonbot; verify which you’re seeing. (meta-amazon-ai-crawlers)Bytespider(ByteDance) — the canonical non-compliant scraper; widely reported to ignore robots.txt. Firewall-only enforcement. (glossary/bytespider)- Agentic-commerce/shopping bots — no stable, widely-documented robots.txt token as of mid-2026; they run via browser-agent fetchers. Watch this space.
A recommended starting robots.txt
# Block training
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
# Allow retrieval/search (these cite you)
User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-fetch agents are allowed by default (you wouldn’t list them as Disallow). For the reasoning behind this policy, see which-ai-bots-to-block; for enforcing it against bots that ignore robots.txt, see robots-txt-vs-waf-ai-bots.
Common questions
- Q: Will blocking GPTBot hurt my ChatGPT visibility? A: No — that’s
OAI-SearchBot, a separate bot you can allow. See gptbot-vs-oai-searchbot. - Q: Does blocking Google-Extended hurt my Google ranking? A: No — it’s an opt-out token, not Googlebot. See google-ai-crawlers.
- Q: Is Cloudflare already blocking these for me? A: Possibly — it blocks AI crawlers by default since July 2025. See does-cloudflare-block-ai-crawlers.
Maintenance note
User-agent tokens, version numbers, and IP-range publication change frequently — this directory is a mid-2026 snapshot. Re-verify against each vendor’s official docs before deploying, and refresh quarterly. A stale block list is a silent liability.
Key takeaways
- Classify every bot as training (block), retrieval (allow), or user-fetch (never block).
- Verify by published IP range, not the user-agent string (it’s spoofable).
Google-ExtendedandApplebot-Extendedare opt-out tokens, not crawlers.- robots.txt governs compliant bots only; non-compliant scrapers (Bytespider) need a firewall.
- Treat the tokens as a dated layer — re-check quarterly.
Related articles
- openai-crawlers · anthropic-crawlers · google-ai-crawlers · perplexity-crawlers · meta-amazon-ai-crawlers — per-vendor deep dives
- which-ai-bots-to-block — the policy behind the table
- robots-txt-vs-waf-ai-bots — how to enforce it
- glossary/ai-crawler · glossary/gptbot · glossary/ccbot · glossary/bytespider — definitions
- seo/ai-crawler-access — the wiki reference page
Sources
- OpenAI — Bots / Crawlers documentation
- Anthropic — Does Anthropic crawl the web, and how to block it
- Perplexity — Crawlers documentation
- seo/ai-crawler-access — internal synthesis with the full taxonomy