AI Crawlers Rate Limiting & Honeypots

Introduction

Working with different projects and clients, I noticed that protecting your website from unwanted AI crawlers is becoming increasingly important. So today I wanted to share a few practical tips that really helped improve how I manage server resources and prevent unnecessary crawling—including a special "honeypot" technique to permanently block the worst offenders.

In The AI Crawler Era: Time to Step Up Your Server Game

Here's the reality—AI crawlers are not going away. OpenAI, Anthropic, Perplexity, Apple, Meta, ByteDance, and countless others are all crawling the web to train their models. And they're crawling aggressively. Your server infrastructure needs to adapt.

Why This Matters Now:

Unlike traditional search engine bots that crawl politely and occasionally, AI crawlers are relentless. They hit high-traffic sites constantly, consuming bandwidth and CPU cycles. If you're not prepared, your bills go up, your performance tanks, and your legitimate users suffer. The old approach of "just let everyone in" doesn't cut it anymore.

The future of the web is going to be shaped by whoever controls their server resources best. Make sure that's you.

Here are some practical examples and tips to help you take back control.

Step 1 - nginx for Rate Limiting

When it comes to controlling AI crawlers, nginx is hands down the most efficient approach. nginx applies rate limiting at the server edge before requests even touch your application or database.

Instead of using if statements (which can behave unpredictably in nginx), the best practice is to map the rate limits dynamically. If a variable resolves to an empty string, nginx bypasses the limit entirely.

Complete example for /etc/nginx/nginx.conf:

http {
    # Basic nginx settings
    include /etc/nginx/mime.types;
    default_type application/octet-stream;

    # Detect crawlers and map their IP to the crawler zone
    map $http_user_agent $crawler_ip {
        default ""; # Empty string means no limit applied in this zone
        ~*(bot|crawl|spider|gpt|claude|anthropic|perplexity|applebot|bytespider) $binary_remote_addr;
    }

    # Map normal users to the normal zone
    map $http_user_agent $normal_ip {
        default $binary_remote_addr;
        ~*(bot|crawl|spider|gpt|claude|anthropic|perplexity|applebot|bytespider) "";
    }

    # Define the rate limit zones
    limit_req_zone $crawler_ip zone=crawler_limit:10m rate=30r/m;
    limit_req_zone $normal_ip zone=normal_limit:10m rate=100r/s;

    # Include site configurations
    include /etc/nginx/sites-enabled/*;
}

Example for your site file at /etc/nginx/sites-available/your-site.conf:

server {
    listen 80;
    server_name example.com;

    # Apply the mapped rate limits
    location / {
        limit_req zone=crawler_limit burst=5 nodelay;
        limit_req zone=normal_limit burst=20 nodelay;

        # Your other location directives
        try_files $uri $uri/ =404;
    }
}

This approach limits identified crawlers to 30 requests per minute. Regular users get a much higher limit of 100 requests per second. Crawlers are allowed to access your site, but at a controlled pace that won't hammer your infrastructure.

Pro Tip: Verify Real Crawlers
Some malicious bots spoof their user-agents to pretend to be OpenAI or Claude. You can validate the IP address against official lists (like https://openai.com/gptbot.json) to separate the real deal from the imposters.

Remember to reload nginx after the changes are made:

sudo nginx -t
sudo systemctl reload nginx

Check if the limiter works:

for i in {1..50}; do curl -A "GPTBot" -o /dev/null -s -w "%{http_code}\n" http://<your_domain>; done
for i in {1..50}; do curl -A "Mozilla/5.0" -o /dev/null -s -w "%{http_code}\n" http://<your_domain>; done

Step 2 - The Honeypot Trap (Special for IP Blocking)

Rate limiting is great for polite bots, but what about aggressive, rogue scrapers that fake their User-Agent and ignore your limits? You trap them.

A "Honeypot" is a trap designed specifically for automated bots. Human users only see the webpage visually, but crawlers read the raw HTML DOM. If you hide a link in your HTML, humans will never click it, but greedy scrapers will immediately follow it.

Add the Trap in Your HTML

Inject an invisible link into your website's header, footer, or body (e.g. in /var/www/html/your-site.html).
```
<a href="/api/hidden-trap-door" style="display: none; visibility: hidden;" rel="nofollow">Do Not Click</a>
```

Nginx Auto-Ban Configuration

Now, configure your server to drop the connection instantly if any IP visits that specific trap URL, and log their IP so you can permanently ban them (e.g. in /etc/nginx/sites-available/your-site.conf).

# Add this inside your server { ... } block
location = /api/hidden-trap-door {
    # Log this specifically so Fail2Ban can read it
    access_log /var/log/nginx/honeypot.log;

    # 444 closes the connection instantly without returning a response
    return 444;
}

Remember to reload nginx after the changes are made:

sudo nginx -t
sudo systemctl reload nginx

Check if the trap works

# This should return something like 'Empty reply from server'
curl -v http://<your_domain>/api/hidden-trap-door

# Check the log entry on your server
sudo tail -f /var/log/nginx/honeypot.log

Permanent Blocking (Fail2Ban)

You can point a tool like Fail2Ban to monitor /var/log/nginx/honeypot.log. If Fail2Ban sees an IP address show up in that file, it automatically updates your server's iptables firewall to drop all future packets from that IP. The bot is banished from your server entirely at the operating system level!

Step 3 - .htaccess Works, But It's Not Optimal

While it works for blocking crawlers, Apache still processes the request through its pipeline. It's better than letting it reach your app, but it's not as efficient as nginx.

Example for /var/www/html/.htaccess:

# Apache rate limiting and bot blocking
SetEnvIf User-Agent "GPTBot" bad_bot
SetEnvIf User-Agent "anthropic-ai" good_bot
SetEnvIf User-Agent "CCBot" bad_bot
SetEnvIf User-Agent "facebookexternalhit" bad_bot

# Block the identified bots
<RequireAll>
    Require all granted
    Require not env bad_bot
</RequireAll>

It gets the job done, but you're still consuming more resources than with nginx.

Step 4 - Middleware For Custom Logic

If you want fine-grained control and you're willing to use some application resources, middleware is your tool.

const express = require('express');
const app = express();

app.use((req, res, next) => {
    const userAgent = req.get('user-agent') || '';
    const blockedBots = ['GPTBot', 'anthropic-ai', 'CCBot', 'Bytespider'];

    if (blockedBots.some(bot => userAgent.includes(bot))) {
        res.set({
            'Retry-After': '60',
            'X-RateLimit-Limit': '30',
            'X-RateLimit-Remaining': '0',
            'X-RateLimit-Reset': new Date(Date.now() + 60000).toISOString()
        });

        return res.status(429).json({
            error: 'Rate limit exceeded',
            message: 'AI crawler requests are rate limited. Please slow down.'
        });
    }
    next();
});

// Your routes...

It's elegant and flexible, but remember—this runs for every request. Use it when you need application-level context.

Step 5 - robots.txt is Just a Polite Suggestion

Here's something a lot of people don't realize: /var/www/html/robots.txt is basically an honor system. Well-behaved crawlers follow it, but rogue AI scrapers do not.

# Allow specific paths for GPTBot while setting crawl limits
User-agent: GPTBot
Allow: /
Crawl-delay: 10

# Block other AI crawlers entirely
User-agent: CCBot
Disallow: /

User-agent: *
Crawl-delay: 5

Use robots.txt as your very first line of defense to ask nicely, but don't rely on it as security.

Step 6 - Combine Your Defenses

The real power comes from layering your defenses. Use a Defense in Depth strategy:

robots.txt catches the polite bots and sets the rules.
Nginx Rate Limits slow down the aggressive bots that ignore crawl delays.
The Honeypot catches the malicious scrapers faking their user-agents and permanently bans their IPs.
Middleware gives you custom logic when you need it. Optional, but useful for specific cases.

Step 7 - Monitor Your Access Logs

Before you start blocking blindly, actually see who's crawling you. You can grep your logs to see the biggest offenders:

sudo grep -i "GPTBot\|anthropic-ai\|bytespider" /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -nr | head -10

This command not only finds the bots but lists the top 10 IP addresses hitting your server with those user-agents. Knowing is half the battle.

Conclusion

If you found this helpful, share it with your dev friends and colleagues. Protecting your server resources from unwanted crawlers is just as important as optimizing your code. And if you have any questions or tips of your own, feel free to drop a comment or reach out.

riiiaddesign@gmail.com

Happy blocking! 🚀

Next steps:

If you want to dive deeper into this stuff:

AI Crawlers Rate Limiting & Honeypots

Author

Published

Time to read

Table of Contents