Pages tagged "robots-txt"
4 pages tagged with robots-txt.
← all tags
- AI Crawler Access Control — Bot Taxonomy, robots.txt vs WAF How to decide which AI bots to allow or block: the training / retrieval / user-fetch taxonomy, why a WAF enforces where robots.txt only requests, and the current (2026) user-agent strings for OpenAI, Anthropic, Google, Perplexity, and Meta crawlers.
- Bytespider — Definition Bytespider is ByteDance's (TikTok's parent) web crawler, widely reported to ignore robots.txt and crawl aggressively. It's the canonical example of why robots.txt alone can't stop a non-compliant AI scraper.
- CCBot (Common Crawl) — Definition CCBot is Common Crawl's web crawler. Common Crawl is a nonprofit that publishes a free, open archive of the web — and that archive is a major training-data source for many AI models. CCBot respects robots.txt.
- GPTBot — Definition GPTBot is OpenAI's web crawler that collects content to train its models. It respects robots.txt, publishes its IP ranges, and is distinct from OAI-SearchBot (search) and ChatGPT-User (user-fetch).