The AI Crawler Directory: Every User-Agent, What It Does, Allow or Block (2026)

By Andrej Ruckij · June 17, 2026

TL;DR: This is the master reference for AI crawler user-agents as of mid-2026. The rule that organizes all of it: classify each bot as training (block to opt out — no traffic back), retrieval/search (allow — it cites you), or user-fetch (never block — it’s a visitor). Below: every major bot, its exact user-agent token, whether it respects robots.txt, its IP-range file for verification, and a recommendation. Treat the tokens as a dated layer — they drift; re-check quarterly.

What you’ll learn

The exact user-agent token for every major AI crawler
Which category each falls into (training / retrieval / user-fetch / opt-out token)
Whether each respects robots.txt and where to verify it by IP range
A clear allow/block recommendation per bot
Which “bots” are actually opt-out tokens, not crawlers

How to read this directory

Three categories drive every decision (full taxonomy: glossary/ai-crawler):

Training — turns your content into model weights; no citation, no traffic. Block to opt out.
Retrieval / search — indexes you to cite in AI answers, with a link. Allow.
User-fetch — fires when a real person asks an AI to open your page. Never block — it’s a visitor.

Two cautions before the table: user-agent strings are spoofable, so verify anything you care about against the operator’s published IP-range file; and robots.txt only governs compliant bots — non-compliant scrapers need a firewall (robots-txt-vs-waf-ai-bots).

Training crawlers (block to opt out of training)

User-agent	Vendor	Respects robots.txt	IP ranges	Notes
`GPTBot`	OpenAI	Yes	`openai.com/gptbot.json`	`GPTBot/1.3`. See openai-crawlers
`ClaudeBot`	Anthropic	Yes	`claude.com/crawling/bots.json`	Anthropic now publishes ranges (2026). anthropic-crawlers
`CCBot`	Common Crawl	Yes	—	Open dataset feeding many models. glossary/ccbot
`Meta-ExternalAgent`	Meta	Yes	—	`meta-externalagent/1.1`; Meta AI / Llama. meta-amazon-ai-crawlers
`Amazonbot`	Amazon	Yes	—	May feed AI; mind the `Amzn-SearchBot` sibling
`Google-Extended`	Google	Yes (token)	—	Opt-out token, not a crawler — see below. google-ai-crawlers
`Applebot-Extended`	Apple	Yes (token)	—	Opt-out token read by `Applebot` for Apple Intelligence training

Retrieval / search crawlers (allow — they cite you)

User-agent	Vendor	Respects robots.txt	IP ranges	Notes
`OAI-SearchBot`	OpenAI	Yes	`openai.com/searchbot.json`	`OAI-SearchBot/1.3`. Allow to stay in ChatGPT
`Claude-SearchBot`	Anthropic	Yes	`claude.com/crawling/bots.json`	Allow for Claude citations
`PerplexityBot`	Perplexity	Yes	`perplexity.com/perplexitybot.json`	See the Aug-2025 caveat: perplexity-crawlers
`Google-CloudVertexBot`	Google	Yes	—	Real crawler (unlike Google-Extended); Cloud/Vertex

User-fetch agents (never block — they’re visitors)

User-agent	Vendor	Respects robots.txt	IP ranges	Notes
`ChatGPT-User`	OpenAI	Yes	`openai.com/chatgpt-user.json`	Fires on a real user’s request
`Claude-User`	Anthropic	Yes	`claude.com/crawling/bots.json`	Fires on a real user’s request
`Perplexity-User`	Perplexity	No (by design)	`perplexity.com/perplexity-user.json`	User-triggered; ignores robots.txt
`Meta-ExternalFetcher`	Meta	Yes	—	Real-time when a user asks Meta AI

The “bots” that aren’t crawlers

Two of the most-discussed entries make no requests at all:

Google-Extended — a robots.txt opt-out token. It tells Google not to use already-crawled content for Gemini/Vertex training. Blocking it does not crawl-block anything and does not affect Google Search ranking. (google-ai-crawlers)
Applebot-Extended — same idea for Apple Intelligence; a signal read by the existing Applebot.

Disallowing these changes data use, not access. Conflating opt-out tokens with crawlers is the most common AI-access mistake.

Other bots worth knowing

OAI-AdsBot (OpenAI) — validates ad landing pages. New in 2026.
Amzn-SearchBot (Amazon) — a search-only sibling of Amazonbot; verify which you’re seeing. (meta-amazon-ai-crawlers)
Bytespider (ByteDance) — the canonical non-compliant scraper; widely reported to ignore robots.txt. Firewall-only enforcement. (glossary/bytespider)
Agentic-commerce/shopping bots — no stable, widely-documented robots.txt token as of mid-2026; they run via browser-agent fetchers. Watch this space.

A recommended starting robots.txt

# Block training
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /

# Allow retrieval/search (these cite you)
User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /

User-fetch agents are allowed by default (you wouldn’t list them as Disallow). For the reasoning behind this policy, see which-ai-bots-to-block; for enforcing it against bots that ignore robots.txt, see robots-txt-vs-waf-ai-bots.

Common questions

Q: Will blocking GPTBot hurt my ChatGPT visibility? A: No — that’s OAI-SearchBot, a separate bot you can allow. See gptbot-vs-oai-searchbot.
Q: Does blocking Google-Extended hurt my Google ranking? A: No — it’s an opt-out token, not Googlebot. See google-ai-crawlers.
Q: Is Cloudflare already blocking these for me? A: Possibly — it blocks AI crawlers by default since July 2025. See does-cloudflare-block-ai-crawlers.

Maintenance note

User-agent tokens, version numbers, and IP-range publication change frequently — this directory is a mid-2026 snapshot. Re-verify against each vendor’s official docs before deploying, and refresh quarterly. A stale block list is a silent liability.

Key takeaways

Classify every bot as training (block), retrieval (allow), or user-fetch (never block).
Verify by published IP range, not the user-agent string (it’s spoofable).
Google-Extended and Applebot-Extended are opt-out tokens, not crawlers.
robots.txt governs compliant bots only; non-compliant scrapers (Bytespider) need a firewall.
Treat the tokens as a dated layer — re-check quarterly.

openai-crawlers · anthropic-crawlers · google-ai-crawlers · perplexity-crawlers · meta-amazon-ai-crawlers — per-vendor deep dives
which-ai-bots-to-block — the policy behind the table
robots-txt-vs-waf-ai-bots — how to enforce it
glossary/ai-crawler · glossary/gptbot · glossary/ccbot · glossary/bytespider — definitions
seo/ai-crawler-access — the wiki reference page

Sources

OpenAI — Bots / Crawlers documentation
Anthropic — Does Anthropic crawl the web, and how to block it
Perplexity — Crawlers documentation
seo/ai-crawler-access — internal synthesis with the full taxonomy