How to Block AI Scrapers: The Complete Enforcement Guide (2026)

robots.txt won't stop scrapers that ignore it. This is the enforcement layer: WAF rules, bot verification by IP range, IP/ASN blocking, rate limiting, and tarpits — how to actually keep non-compliant AI crawlers out.

By Andrej Ruckij · · 5 min read

How to Block AI Scrapers: The Complete Enforcement Guide (2026)

By Andrej Ruckij · June 17, 2026

TL;DR: robots.txt only asks — to actually stop a scraper that ignores it, you enforce at the edge. The enforcement stack, in order: a WAF to block at the firewall, verification by IP range (since user-agents are spoofable), IP/ASN blocking and rate limiting for stealth and aggressive crawlers, and tarpits as a last-resort escalation. Compliant bots belong in robots.txt; this guide is for the ones that don’t comply.

What you’ll learn

  • Why robots.txt can’t enforce, and what does
  • The enforcement stack from cheapest to most aggressive
  • How to verify a bot is real before acting on it
  • When to escalate to IP/ASN blocking or tarpits
  • How managed WAFs bundle most of this

This is the operational counterpart to the conceptual robots-txt-vs-waf-ai-bots and the umbrella access-control guide.

First, the dividing line

There are two populations of AI crawler, and they need completely different tools:

  • Compliant bots (GPTBot, ClaudeBot, PerplexityBot, etc.) read robots.txt and obey. For these, robots.txt is the right tool — set your policy and you’re done (which-ai-bots-to-block). No enforcement needed.
  • Non-compliant scrapers (the Bytespider type, plus stealth crawlers that spoof identity) ignore robots.txt entirely. For these, robots.txt does nothing — and everything below applies.

This whole guide is about the second group. If you only care about the first, you don’t need it (can-robots-txt-stop-ai-scrapers).

Why robots.txt can’t enforce

robots.txt is a voluntary standard — it relies on the bot choosing to read and honor it. A scraper that ignores it never even consults the file. A WAF (web application firewall), by contrast, blocks the request at the edge with a 403 before it reaches your origin, regardless of cooperation — and it overrides robots.txt entirely. That’s the core principle (glossary/waf, full treatment in robots-txt-vs-waf-ai-bots): robots.txt asks; the firewall enforces.

The enforcement stack (cheapest to most aggressive)

Work up this ladder; most sites never need the top rungs.

1. Rate limiting (start here)

Cap requests per IP per time window. It targets behavior, not identity, so it catches aggressive and stealth crawlers alike without needing to know who they are — and it rarely harms real users, who don’t make hundreds of requests a minute. Cheapest, broadest, lowest collateral. (block-ai-crawler-ip-asn)

2. Verification (before you block by name)

Because user-agent strings are spoofable, confirm identity before acting: check the requesting IP against the operator’s published IP-range file, or use forward-confirmed reverse DNS. A request claiming to be GPTBot from outside OpenAI’s ranges is a scraper, not GPTBot. Full method: verify-ai-bots.

3. WAF rules

Block confirmed-bad traffic at the firewall. This is the actual enforcement step — a managed ruleset or a custom rule returns a 403 at the edge. Pair user-agent rules (for honest bots you want gone) with verified-IP rules (for impersonators).

4. IP / ASN blocking

For scrapers that rotate addresses: block the specific IPs, or — for crawlers running entirely out of one hosting/cloud network — block or challenge the whole ASN. Powerful but blunt; never hard-block an ASN that also carries real users (prefer a challenge there). Detail: block-ai-crawler-ip-asn.

5. Tarpits (last resort)

Under persistent pressure, escalate from blocking to wasting the crawler’s resources: Nepenthes (infinite maze), Anubis (proof-of-work), or Cloudflare AI Labyrinth (managed decoys). Effective but with real tradeoffs — they can cost your resources and cause collateral damage. Reserve for genuine, sustained scraping. Detail: ai-crawler-tarpits.

The managed shortcut

Most of this stack is available as a managed service. Cloudflare and similar CDNs bundle bot scoring, verified-bot lists, ASN intelligence, rate rules, and even a tarpit (AI Labyrinth) — and since July 2025 Cloudflare blocks AI crawlers by default (does-cloudflare-block-ai-crawlers). For most teams, enabling and tuning a managed AI-bot ruleset is more effective than hand-rolling firewall rules. Just remember it operates at the edge, so it overrides — and must be reconciled with — your robots.txt (ai-crawler-access-checklist step 4).

A realistic expectation

Enforcement makes crawling you expensive and slow, not impossible. A determined scraper on a large residential proxy pool is genuinely hard to fully stop; the goal is to push its cost above the value it extracts. And none of this should touch the bots you want — verify and allow the search/retrieval crawlers that cite you, or you’ll enforce your way out of AI visibility (what-you-lose-blocking-ai-search-bots).

Key takeaways

  • robots.txt handles compliant bots; this enforcement stack handles the ones that ignore it.
  • robots.txt asks, a WAF enforces — the firewall blocks at the edge and overrides robots.txt.
  • Climb the ladder: rate-limit → verify → WAF-block → IP/ASN → tarpit. Most sites stop at the middle rungs.
  • Verify by IP range before blocking by name; never hard-block mixed-traffic ASNs.
  • Managed WAFs bundle most of it; the goal is to make scraping uneconomical, not impossible — and never block the bots that cite you.

Sources