The AI Crawler Directory: Every User-Agent, What It Does, Allow or Block (2026)

A complete reference table of AI crawler user-agents — GPTBot, ClaudeBot, PerplexityBot, Google-Extended, CCBot, Bytespider and more — with each bot's job, robots.txt compliance, IP-range file, and an allow/block recommendation.

By Andrej Ruckij · · 6 min read

The AI Crawler Directory: Every User-Agent, What It Does, Allow or Block (2026)

By Andrej Ruckij · June 17, 2026

TL;DR: This is the master reference for AI crawler user-agents as of mid-2026. The rule that organizes all of it: classify each bot as training (block to opt out — no traffic back), retrieval/search (allow — it cites you), or user-fetch (never block — it’s a visitor). Below: every major bot, its exact user-agent token, whether it respects robots.txt, its IP-range file for verification, and a recommendation. Treat the tokens as a dated layer — they drift; re-check quarterly.

What you’ll learn

  • The exact user-agent token for every major AI crawler
  • Which category each falls into (training / retrieval / user-fetch / opt-out token)
  • Whether each respects robots.txt and where to verify it by IP range
  • A clear allow/block recommendation per bot
  • Which “bots” are actually opt-out tokens, not crawlers

How to read this directory

Three categories drive every decision (full taxonomy: glossary/ai-crawler):

  • Training — turns your content into model weights; no citation, no traffic. Block to opt out.
  • Retrieval / search — indexes you to cite in AI answers, with a link. Allow.
  • User-fetch — fires when a real person asks an AI to open your page. Never block — it’s a visitor.

Two cautions before the table: user-agent strings are spoofable, so verify anything you care about against the operator’s published IP-range file; and robots.txt only governs compliant bots — non-compliant scrapers need a firewall (robots-txt-vs-waf-ai-bots).

Training crawlers (block to opt out of training)

User-agentVendorRespects robots.txtIP rangesNotes
GPTBotOpenAIYesopenai.com/gptbot.jsonGPTBot/1.3. See openai-crawlers
ClaudeBotAnthropicYesclaude.com/crawling/bots.jsonAnthropic now publishes ranges (2026). anthropic-crawlers
CCBotCommon CrawlYesOpen dataset feeding many models. glossary/ccbot
Meta-ExternalAgentMetaYesmeta-externalagent/1.1; Meta AI / Llama. meta-amazon-ai-crawlers
AmazonbotAmazonYesMay feed AI; mind the Amzn-SearchBot sibling
Google-ExtendedGoogleYes (token)Opt-out token, not a crawler — see below. google-ai-crawlers
Applebot-ExtendedAppleYes (token)Opt-out token read by Applebot for Apple Intelligence training

Retrieval / search crawlers (allow — they cite you)

User-agentVendorRespects robots.txtIP rangesNotes
OAI-SearchBotOpenAIYesopenai.com/searchbot.jsonOAI-SearchBot/1.3. Allow to stay in ChatGPT
Claude-SearchBotAnthropicYesclaude.com/crawling/bots.jsonAllow for Claude citations
PerplexityBotPerplexityYesperplexity.com/perplexitybot.jsonSee the Aug-2025 caveat: perplexity-crawlers
Google-CloudVertexBotGoogleYesReal crawler (unlike Google-Extended); Cloud/Vertex

User-fetch agents (never block — they’re visitors)

User-agentVendorRespects robots.txtIP rangesNotes
ChatGPT-UserOpenAIYesopenai.com/chatgpt-user.jsonFires on a real user’s request
Claude-UserAnthropicYesclaude.com/crawling/bots.jsonFires on a real user’s request
Perplexity-UserPerplexityNo (by design)perplexity.com/perplexity-user.jsonUser-triggered; ignores robots.txt
Meta-ExternalFetcherMetaYesReal-time when a user asks Meta AI

The “bots” that aren’t crawlers

Two of the most-discussed entries make no requests at all:

  • Google-Extended — a robots.txt opt-out token. It tells Google not to use already-crawled content for Gemini/Vertex training. Blocking it does not crawl-block anything and does not affect Google Search ranking. (google-ai-crawlers)
  • Applebot-Extended — same idea for Apple Intelligence; a signal read by the existing Applebot.

Disallowing these changes data use, not access. Conflating opt-out tokens with crawlers is the most common AI-access mistake.

Other bots worth knowing

  • OAI-AdsBot (OpenAI) — validates ad landing pages. New in 2026.
  • Amzn-SearchBot (Amazon) — a search-only sibling of Amazonbot; verify which you’re seeing. (meta-amazon-ai-crawlers)
  • Bytespider (ByteDance) — the canonical non-compliant scraper; widely reported to ignore robots.txt. Firewall-only enforcement. (glossary/bytespider)
  • Agentic-commerce/shopping bots — no stable, widely-documented robots.txt token as of mid-2026; they run via browser-agent fetchers. Watch this space.
# Block training
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /

# Allow retrieval/search (these cite you)
User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /

User-fetch agents are allowed by default (you wouldn’t list them as Disallow). For the reasoning behind this policy, see which-ai-bots-to-block; for enforcing it against bots that ignore robots.txt, see robots-txt-vs-waf-ai-bots.

Common questions

  • Q: Will blocking GPTBot hurt my ChatGPT visibility? A: No — that’s OAI-SearchBot, a separate bot you can allow. See gptbot-vs-oai-searchbot.
  • Q: Does blocking Google-Extended hurt my Google ranking? A: No — it’s an opt-out token, not Googlebot. See google-ai-crawlers.
  • Q: Is Cloudflare already blocking these for me? A: Possibly — it blocks AI crawlers by default since July 2025. See does-cloudflare-block-ai-crawlers.

Maintenance note

User-agent tokens, version numbers, and IP-range publication change frequently — this directory is a mid-2026 snapshot. Re-verify against each vendor’s official docs before deploying, and refresh quarterly. A stale block list is a silent liability.

Key takeaways

  • Classify every bot as training (block), retrieval (allow), or user-fetch (never block).
  • Verify by published IP range, not the user-agent string (it’s spoofable).
  • Google-Extended and Applebot-Extended are opt-out tokens, not crawlers.
  • robots.txt governs compliant bots only; non-compliant scrapers (Bytespider) need a firewall.
  • Treat the tokens as a dated layer — re-check quarterly.

Sources