Pages tagged "ai-crawlers"
8 pages tagged with ai-crawlers.
← all tags
- AI Crawler — Definition An AI crawler is an automated bot that fetches web content for an AI system — to train a model, build a citation index for AI search, or fetch a page a user asked about. The three types determine your access policy.
- AI Crawler Access Control — Bot Taxonomy, robots.txt vs WAF How to decide which AI bots to allow or block: the training / retrieval / user-fetch taxonomy, why a WAF enforces where robots.txt only requests, and the current (2026) user-agent strings for OpenAI, Anthropic, Google, Perplexity, and Meta crawlers.
- Bytespider — Definition Bytespider is ByteDance's (TikTok's parent) web crawler, widely reported to ignore robots.txt and crawl aggressively. It's the canonical example of why robots.txt alone can't stop a non-compliant AI scraper.
- CCBot (Common Crawl) — Definition CCBot is Common Crawl's web crawler. Common Crawl is a nonprofit that publishes a free, open archive of the web — and that archive is a major training-data source for many AI models. CCBot respects robots.txt.
- GPTBot — Definition GPTBot is OpenAI's web crawler that collects content to train its models. It respects robots.txt, publishes its IP ranges, and is distinct from OAI-SearchBot (search) and ChatGPT-User (user-fetch).
- llms.txt — Definition llms.txt is a proposed plain-text/markdown file that gives AI systems a curated map of your site's most important content. It's advisory — it helps comprehension, not access control.
- Pay-Per-Crawl — Definition Pay-per-crawl is Cloudflare's model that lets sites charge AI crawlers for access using the HTTP 402 'Payment Required' status code and crawler-price headers — turning bot access into a transaction instead of a free-for-all.
- WAF (Web Application Firewall) — Definition A WAF is a firewall that inspects and blocks web requests at the edge before they reach your server. For AI bots it's the enforcement layer robots.txt isn't — it acts, robots.txt only asks.