How AI Crawlers Work: From Request to Model, Answer, or Visit

By Andrej Ruckij · June 17, 2026

TL;DR: An AI crawler makes an HTTP request to your page, identifies itself with a user-agent string, (if compliant) checks your robots.txt, and fetches the content. What happens next depends on the bot’s job: a training crawler feeds the content into model weights, a retrieval crawler indexes it to cite later, and a user-fetch agent hands it straight to a person who asked. Same fetch, three very different destinations.

A cluster under the complete guide to AI crawler access control. Understanding the mechanics makes every access decision obvious.

The basic request flow

Mechanically, an AI crawler behaves like any web client:

It requests a URL — a standard HTTP GET to your page.
It identifies itself with a user-agent string (e.g. GPTBot/1.3). Reputable bots use a documented token; see the crawler directory.
A compliant bot checks robots.txt before fetching, and honors Disallow rules. A non-compliant one skips this entirely.
It fetches and parses the page — increasingly able to render JavaScript, though server-rendered content is still more reliably read.

Nothing here is exotic. The interesting part is what the content is for.

The three destinations

The same fetched page goes to one of three places, and that’s the whole taxonomy (glossary/ai-crawler):

Training — the content becomes part of a model’s training data, baked into weights. No link back, no traffic. (e.g. GPTBot, ClaudeBot, CCBot.)
Retrieval / search — the content is indexed so the AI can fetch and cite it when answering a relevant question, with a link. (e.g. OAI-SearchBot, PerplexityBot.)
User-fetch — the content is handed directly to a person who asked the AI to open that specific page, in real time. (e.g. ChatGPT-User, Perplexity-User.)

This is why “block all AI” is a blunt instrument: it treats three very different transactions as one.

How AI crawlers differ from search crawlers

Googlebot crawls to index for ranking — and historically traded crawling for referral traffic. AI training crawlers break that bargain: they fetch heavily and return little or nothing (see ai-crawler-traffic-impact for the asymmetry). AI retrieval crawlers are closer to the old search bargain — they cite and link. So an AI crawler can be more extractive than Googlebot (training) or roughly analogous to it (retrieval), depending on the job.

Identity is claimed, not proven

A crucial mechanical caveat: the user-agent string is self-reported and trivially faked. A scraper can call itself GPTBot while having nothing to do with OpenAI. That’s why verification uses the operator’s published IP ranges rather than the name, and why robots.txt — which keys off the claimed user-agent — only works on bots that choose to be honest. Enforcement against the dishonest ones happens at the firewall (robots-txt-vs-waf-ai-bots).

Key takeaways

An AI crawler requests a URL, identifies via user-agent, (if compliant) checks robots.txt, then fetches.
The same fetch feeds one of three destinations: training, retrieval/search, or user-fetch.
AI training crawlers break the search bargain (heavy fetch, little return); retrieval crawlers preserve it (cite + link).
User-agent identity is claimed, not proven — verify by IP range; enforce at the firewall.

ai-crawler-access-control-guide — the parent guide
glossary/ai-crawler — the three-type taxonomy
ai-crawler-user-agents-directory — the user-agent tokens by vendor
ai-crawler-traffic-impact — the cost side of all this fetching
robots-txt-vs-waf-ai-bots — why claimed identity needs firewall-level verification

Sources

seo/ai-crawler-access — internal synthesis
OpenAI — Bots / Crawlers documentation