How AI Crawlers Work: From Request to Model, Answer, or Visit
How AI crawlers fetch and use your content: the request, user-agent identification, robots.txt check, and the three destinations — model training, a citation index, or a live user's screen.
How AI Crawlers Work: From Request to Model, Answer, or Visit
By Andrej Ruckij · June 17, 2026
TL;DR: An AI crawler makes an HTTP request to your page, identifies itself with a user-agent string, (if compliant) checks your robots.txt, and fetches the content. What happens next depends on the bot’s job: a training crawler feeds the content into model weights, a retrieval crawler indexes it to cite later, and a user-fetch agent hands it straight to a person who asked. Same fetch, three very different destinations.
A cluster under the complete guide to AI crawler access control. Understanding the mechanics makes every access decision obvious.
The basic request flow
Mechanically, an AI crawler behaves like any web client:
- It requests a URL — a standard HTTP GET to your page.
- It identifies itself with a user-agent string (e.g.
GPTBot/1.3). Reputable bots use a documented token; see the crawler directory. - A compliant bot checks robots.txt before fetching, and honors
Disallowrules. A non-compliant one skips this entirely. - It fetches and parses the page — increasingly able to render JavaScript, though server-rendered content is still more reliably read.
Nothing here is exotic. The interesting part is what the content is for.
The three destinations
The same fetched page goes to one of three places, and that’s the whole taxonomy (glossary/ai-crawler):
- Training — the content becomes part of a model’s training data, baked into weights. No link back, no traffic. (e.g. GPTBot, ClaudeBot, CCBot.)
- Retrieval / search — the content is indexed so the AI can fetch and cite it when answering a relevant question, with a link. (e.g. OAI-SearchBot, PerplexityBot.)
- User-fetch — the content is handed directly to a person who asked the AI to open that specific page, in real time. (e.g. ChatGPT-User, Perplexity-User.)
This is why “block all AI” is a blunt instrument: it treats three very different transactions as one.
How AI crawlers differ from search crawlers
Googlebot crawls to index for ranking — and historically traded crawling for referral traffic. AI training crawlers break that bargain: they fetch heavily and return little or nothing (see ai-crawler-traffic-impact for the asymmetry). AI retrieval crawlers are closer to the old search bargain — they cite and link. So an AI crawler can be more extractive than Googlebot (training) or roughly analogous to it (retrieval), depending on the job.
Identity is claimed, not proven
A crucial mechanical caveat: the user-agent string is self-reported and trivially faked. A scraper can call itself GPTBot while having nothing to do with OpenAI. That’s why verification uses the operator’s published IP ranges rather than the name, and why robots.txt — which keys off the claimed user-agent — only works on bots that choose to be honest. Enforcement against the dishonest ones happens at the firewall (robots-txt-vs-waf-ai-bots).
Key takeaways
- An AI crawler requests a URL, identifies via user-agent, (if compliant) checks robots.txt, then fetches.
- The same fetch feeds one of three destinations: training, retrieval/search, or user-fetch.
- AI training crawlers break the search bargain (heavy fetch, little return); retrieval crawlers preserve it (cite + link).
- User-agent identity is claimed, not proven — verify by IP range; enforce at the firewall.
Related articles
- ai-crawler-access-control-guide — the parent guide
- glossary/ai-crawler — the three-type taxonomy
- ai-crawler-user-agents-directory — the user-agent tokens by vendor
- ai-crawler-traffic-impact — the cost side of all this fetching
- robots-txt-vs-waf-ai-bots — why claimed identity needs firewall-level verification
Sources
- seo/ai-crawler-access — internal synthesis
- OpenAI — Bots / Crawlers documentation