#robots-txt
22 posts tagged with robots-txt.
-
AI Crawler Access Control: The Complete Guide (2026)
Everything site owners need on AI crawler access: the training/retrieval/user-fetch taxonomy, the block-or-allow decision, how to set up robots.txt, why a WAF enforces where robots.txt asks, llms.txt, costs, and regulation.
-
AI Crawler Access Checklist: 8 Steps for Site Owners
A practical checklist to get AI crawler access right: audit current access, set a robots.txt policy, reconcile your CDN/WAF, verify by IP range, decide on llms.txt, monitor, and refresh quarterly.
-
AI Crawler Regulation in the EU and UK (2026): What Site Owners Should Know
The EU AI Act makes machine-readable opt-outs (like robots.txt) legally meaningful for AI training; the UK dropped its text-and-data-mining opt-out plan in March 2026 and is waiting. What that means for your robots.txt.
-
Do AI Crawlers Cost You Money? Bandwidth, Server Load, and the Broken Bargain
AI crawlers can consume real bandwidth and server resources — and training crawlers especially give little back. Here's the cost side of AI crawling, how to measure it, and when it justifies blocking.
-
The AI Crawler Directory: Every User-Agent, What It Does, Allow or Block (2026)
A complete reference table of AI crawler user-agents — GPTBot, ClaudeBot, PerplexityBot, Google-Extended, CCBot, Bytespider and more — with each bot's job, robots.txt compliance, IP-range file, and an allow/block recommendation.
-
Anthropic's Crawlers: ClaudeBot, Claude-SearchBot, Claude-User
Anthropic runs ClaudeBot (training), Claude-SearchBot (search), and Claude-User (user-fetch). All respect robots.txt, and — updated in 2026 — Anthropic now publishes IP ranges for all three.
-
Should You Block or Allow AI Crawlers? The 2026 Decision Framework
A decision framework for AI crawler access: block training (it takes, gives nothing), allow search (it cites you), never block user-fetch (it's a visitor). The right answer depends on whether your content is your product or your marketing.
-
Google's AI Crawlers: Google-Extended, Google-CloudVertexBot, and Gemini
Google's AI crawling is confusing because Google-Extended isn't a crawler — it's an opt-out token. Here's how Google-Extended, Google-CloudVertexBot, and Googlebot relate to Gemini training and AI features.
-
How AI Crawlers Work: From Request to Model, Answer, or Visit
How AI crawlers fetch and use your content: the request, user-agent identification, robots.txt check, and the three destinations — model training, a citation index, or a live user's screen.
-
How to Block AI Scrapers: The Complete Enforcement Guide (2026)
robots.txt won't stop scrapers that ignore it. This is the enforcement layer: WAF rules, bot verification by IP range, IP/ASN blocking, rate limiting, and tarpits — how to actually keep non-compliant AI crawlers out.
-
Meta and Amazon AI Crawlers: Meta-ExternalAgent, Meta-ExternalFetcher, Amazonbot
Meta runs Meta-ExternalAgent (training) and Meta-ExternalFetcher (user-fetch); Amazon runs Amazonbot, with a search-only sibling Amzn-SearchBot that's easy to confuse. How to handle each.
-
OpenAI's Crawlers: GPTBot, OAI-SearchBot, ChatGPT-User (and OAI-AdsBot)
OpenAI runs separate bots for separate jobs: GPTBot (training), OAI-SearchBot (search), ChatGPT-User (user-fetch), and OAI-AdsBot. Here's what each does, whether to block it, and how to verify it by IP range.
-
Perplexity's Crawlers: PerplexityBot, Perplexity-User, and the Stealth-Crawling Controversy
PerplexityBot indexes for citations; Perplexity-User fetches pages users ask about (and ignores robots.txt by design). Plus the August 2025 Cloudflare report that Perplexity crawled sites that blocked it.
-
Publishers Are Blocking AI Crawlers: Who, Why, and What It Means for You
Around 80% of top news sites now block AI training bots, using blocking as leverage for licensing deals. Why publishers block — and why the publisher playbook usually doesn't fit a business that needs AI visibility.
-
What You Lose by Blocking AI Search Bots
Blocking AI search/retrieval bots (OAI-SearchBot, PerplexityBot) removes you from AI answers entirely — no citation, no referral traffic, no presence when buyers ask AI. Here's the real cost of over-blocking.
-
Can robots.txt Stop AI Scrapers?
No. robots.txt only asks compliant bots to stay away — non-compliant AI scrapers ignore it. To actually stop them you need a WAF, IP/ASN blocking, and bot verification at the edge.
-
Do AI Crawlers Respect robots.txt?
Some do, many don't. Reputable AI crawlers like GPTBot, ClaudeBot, and PerplexityBot honor robots.txt; non-compliant scrapers ignore it. robots.txt is a request, not enforcement.
-
llms.txt vs robots.txt: What's the Difference?
robots.txt controls crawler access (what bots may fetch); llms.txt offers AI a curated content map (comprehension). One is about permission, the other about understanding — and neither actually enforces anything.
-
Why robots.txt Won't Block AI Bots (and What Actually Does)
robots.txt only asks AI crawlers to stay away — a WAF enforces. Here's why a firewall rule beats robots.txt, why non-compliant scrapers ignore your txt file, and the layered setup that actually controls AI bot access.
-
Should I Allow AI Crawlers?
Allow AI search and user-fetch crawlers — they cite you in AI answers and bring real visitors. Consider blocking only training crawlers, which take content for model training with nothing back.
-
Should I Block GPTBot?
Block GPTBot if you don't want your content training OpenAI's models for free — it gives no traffic back. But blocking GPTBot doesn't affect ChatGPT search visibility; that's a separate bot you can allow.
-
Which AI Bots Should You Block? (And Why robots.txt Won't Stop Them)
A plain-English guide to AI crawler access: the training vs. retrieval vs. user-fetch bot taxonomy, which to allow or block, and why a firewall enforces where robots.txt only asks.