How to Verify a Real AI Bot (IP Ranges, Reverse DNS)

By Andrej Ruckij · June 17, 2026

TL;DR: A bot’s user-agent string is self-reported and trivially faked, so never trust the name alone. Verify against the operator’s published IP-range file (OpenAI, Anthropic, and Perplexity each publish one) or via reverse DNS. A request claiming to be GPTBot from outside OpenAI’s published ranges is not GPTBot — and should be treated as a scraper.

A cluster under the enforcement guide. Verification is the foundation of every other enforcement step — block and allow rules are only as trustworthy as your ability to confirm who’s actually knocking.

Why the user-agent isn’t enough

The user-agent string is just an HTTP header the client sets itself. Any scraper can send User-agent: GPTBot while having nothing to do with OpenAI — and stealth crawlers do exactly this to slip past name-based rules (the Perplexity stealth-crawl case is the textbook example). So a robots.txt Disallow: GPTBot or a firewall rule keyed on the name catches honest bots and misses dishonest ones. Verification closes that gap.

Method 1: published IP-range files (best)

The major AI operators publish machine-readable lists of the IP ranges their bots crawl from. Check the requesting IP against the relevant file:

Operator	IP-range file
OpenAI (GPTBot, OAI-SearchBot, ChatGPT-User)	`openai.com/gptbot.json`, `openai.com/searchbot.json`, `openai.com/chatgpt-user.json`
Anthropic (ClaudeBot, Claude-SearchBot, Claude-User)	`claude.com/crawling/bots.json`
Perplexity (PerplexityBot, Perplexity-User)	`perplexity.com/perplexitybot.json`, `perplexity.com/perplexity-user.json`

If a request claims a bot’s user-agent but its source IP isn’t in that operator’s published range, it’s not the real bot. Full token reference: ai-crawler-user-agents-directory.

Method 2: reverse DNS (where ranges aren’t published)

For operators that don’t publish IP ranges (e.g. Common Crawl’s CCBot, some others), use a forward-confirmed reverse DNS check: do a reverse DNS lookup on the requesting IP, confirm it resolves to the expected domain, then forward-resolve that hostname back to confirm it matches the IP. This is the same technique used to verify Googlebot. It’s more work than an IP-range lookup but defeats simple spoofing.

Method 3: behavioral signals (the backstop)

When identity can’t be confirmed, behavior gives it away — request velocity far above a human, sequential crawling of deep URLs, ignoring robots.txt, rotating user-agents or IPs within a session. These don’t prove who a bot is, but they flag that something automated and uncooperative is happening, which is enough to rate-limit or challenge it.

Putting it together

For bots you allow (search/user-fetch): verify by IP range so a scraper can’t impersonate them to bypass other rules.
For bots you block: name-based robots.txt handles the honest ones; IP/behavioral enforcement handles the rest (block-ai-crawler-ip-asn).
Managed shortcut: CDNs like Cloudflare maintain verified-bot lists and do this verification for you (does-cloudflare-block-ai-crawlers) — though they’ve also de-listed operators caught spoofing.

Key takeaways

Never trust the user-agent name — it’s spoofable.
Verify by published IP-range file first (OpenAI, Anthropic, Perplexity publish them).
Use forward-confirmed reverse DNS where ranges aren’t published.
Behavioral signals are the backstop for unverifiable bots; CDNs can do verification for you.

how-to-block-ai-scrapers — the parent enforcement guide
ai-crawler-user-agents-directory — the IP-range files per vendor
block-ai-crawler-ip-asn — acting on verification with IP/ASN rules
robots-txt-vs-waf-ai-bots — why name-based rules aren’t enforcement
perplexity-crawlers — the stealth/spoofing case study

Sources

OpenAI — Bots / Crawlers documentation
Anthropic — crawler documentation
seo/ai-crawler-access — internal synthesis