Skip to content

Select theme

Welcome
Marketing
SEO
Competitor analysis
- The Empty Paid-Social Lane in DNA-Personalized Beauty (2026 Market Note)
- Competitor Analysis in 2026 — The Operational Approach
Automation
Tools
Glossary
Comparisons
Cases
Experiments
Questions

Select theme

On this page

Overview

On this page

Overview

Pages tagged "robots-txt"

4 pages tagged with robots-txt. ← all tags

AI Crawler Access Control — Bot Taxonomy, robots.txt vs WAF How to decide which AI bots to allow or block: the training / retrieval / user-fetch taxonomy, why a WAF enforces where robots.txt only requests, and the current (2026) user-agent strings for OpenAI, Anthropic, Google, Perplexity, and Meta crawlers.
Bytespider — Definition Bytespider is ByteDance's (TikTok's parent) web crawler, widely reported to ignore robots.txt and crawl aggressively. It's the canonical example of why robots.txt alone can't stop a non-compliant AI scraper.
CCBot (Common Crawl) — Definition CCBot is Common Crawl's web crawler. Common Crawl is a nonprofit that publishes a free, open archive of the web — and that archive is a major training-data source for many AI models. CCBot respects robots.txt.
GPTBot — Definition GPTBot is OpenAI's web crawler that collects content to train its models. It respects robots.txt, publishes its IP ranges, and is distinct from OAI-SearchBot (search) and ChatGPT-User (user-fetch).