AI爬虫大规模爬取网站内容，导致网站打不开，附解决方案-站长圈子-DZ插件网

AI爬虫大规模爬取网站内容，导致网站打不开，附解决方案

左右不逢缘

2025/05/06 18:22:25

我在昨天时遇到了与@简丶一样的情况，AI工具的爬虫疯狂抓取网站内容，导致网站一直加载中，就是无法正常访问（看日志才发现早就被爬了，只是网站没有挂才没发现），搜解决方法的时候看到了这篇帖子，跟着进行了操作，同时使用豆包AI进行分析，结合豆包给出的方案一起使用，这里分享下

先看日志截图

方案一先是通过宝塔面板 Nginx 免费防火墙插件的 User-Agent 过滤了 AI 爬虫，参考了@雨天榕树大佬在评论区分享的资料链接：https://www.52txr.cn/2025/banaicurl.html
有我根据自身情况新增的爬虫

(ScrapyIAwarioBotIAI2Bot|Ai2Bot-Dolma|aiHitBot|anthropic-ai|ChatGPT-User|Claude-Web|ClaudeBot|cohere-ai|cohere-training-data-crawler|Diffbot|DuckAssistBot|GPTBot|img2dataset|OAI-SearchBot|Perplexity-User|PerplexityBot|PetalBot|Scrapy|SemrushBot-OCOB|TikTokSpider|VelenPublicWebCrawler|YouBot)

又是用 robots.txt 限制 AI 爬虫和百度的爬取频率

# 百度蜘蛛：允许访问，但限制抓取间隔User-Agent: BaiduspiderCrawl-delay: 5# AI爬虫及特殊工具：禁止访问整个网站User-Agent: ScrapyDisallow: /User-Agent: AwarioBotIDisallow: /User-agent: SemrushBot-BADisallow: /User-agent: SemrushBot-SIDisallow: /User-agent: SemrushBot-SWADisallow: /User-agent: SplitSignalBotDisallow: /User-agent: SemrushBot-OCOBDisallow: /User-agent: SemrushBot-FTDisallow: /User-Agent: AI2BotDisallow: /User-Agent: Ai2Bot-DolmaDisallow: /User-Agent: aiHitBotDisallow: /User-Agent: AmazonbotDisallow: /User-Agent: anthropic-aiDisallow: /User-Agent: ApplebotDisallow: /User-Agent: Applebot-ExtendedDisallow: /User-Agent: Brightbot 1.0Disallow: /User-Agent: BytespiderDisallow: /User-Agent: CCBotDisallow: /User-Agent: ChatGPT-UserDisallow: /User-Agent: Claude-WebDisallow: /User-Agent: ClaudeBotDisallow: /User-Agent: cohere-aiDisallow: /User-Agent: cohere-training-data-crawlerDisallow: /User-Agent: CotoyogiDisallow: /User-Agent: CrawlspaceDisallow: /User-Agent: DiffbotDisallow: /User-Agent: DuckAssistBotDisallow: /User-Agent: FacebookBotDisallow: /User-Agent: Factset_spyderbotDisallow: /User-Agent: FirecrawlAgentDisallow: /User-Agent: FriendlyCrawlerDisallow: /User-Agent: Google-ExtendedDisallow: /User-Agent: GoogleOtherDisallow: /User-Agent: GoogleOther-ImageDisallow: /User-Agent: GoogleOther-VideoDisallow: /User-Agent: GPTBotDisallow: /User-Agent: iaskspider/2.0Disallow: /User-Agent: ICC-CrawlerDisallow: /User-Agent: ImagesiftBotDisallow: /User-Agent: img2datasetDisallow: /User-Agent: imgproxyDisallow: /User-Agent: ISSCyberRiskCrawlerDisallow: /User-Agent: Kangaroo BotDisallow: /User-Agent: Meta-ExternalAgentDisallow: /User-Agent: Meta-ExternalFetcherDisallow: /User-Agent: NovaActDisallow: /User-Agent: OAI-SearchBotDisallow: /User-Agent: omgiliDisallow: /User-Agent: omgilibotDisallow: /User-Agent: OperatorDisallow: /User-Agent: PanguBotDisallow: /User-Agent: Perplexity-UserDisallow: /User-Agent: PerplexityBotDisallow: /User-Agent: PetalBotDisallow: /User-Agent: ScrapyDisallow: /User-Agent: SemrushBot-OCOBDisallow: /User-Agent: SemrushBot-SWADisallow: /User-Agent: Sidetrade indexer botDisallow: /User-Agent: TikTokSpiderDisallow: /User-Agent: TimpibotDisallow: /User-Agent: VelenPublicWebCrawlerDisallow: /User-Agent: Webzio-ExtendedDisallow: /User-Agent: YouBotDisallow: /

结果发现还是打不开
方案二（一起使用）豆包给的方案：宝塔面板全局的 NGINX 配置文件中添加（在 http { 内添加）

# 1. 定义百度蜘蛛的User-Agent匹配规则（必须在http块内） map $http_user_agent $is_baidu_spider { default 0; "~*Baiduspider" 1; # 匹配百度蜘蛛的 User-Agent } # 2. 定义限流区域（限制百度蜘蛛的请求频率） limit_req_zone $binary_remote_addr$is_baidu_spider zone=baidu_spider:10m rate=100r/m; # rate=300r/m：每个IP每分钟最多300次请求（可根据服务器性能调整）

然后到网站配置规则里添加（在 server { 内添加）

# ------------------------ 缩略图专用优化（匹配完整路径） ------------------------ # 匹配 /_data/i/upload/ 目录下的所有图片文件（含时间子目录，如 /2024/08/08/） location ~* ^/_data/i/upload/.*\.(jpg|jpeg|png|webp|avif|heic|heif)$ { # 强缓存1年（CDN/浏览器均可缓存） add_header Cache-Control "public, max-age=31536000"; # 兼容旧浏览器（30天缓存） expires 30d; # 关闭缩略图访问日志（减少磁盘IO） access_log /dev/null; # 继承全局防盗链规则（非法 Referer 已被拦截，无需重复判断） } # ------------------------ AI 爬虫与原图保护 ------------------------ # 定义需拦截的 User-Agent（AI 爬虫 + 恶意工具） set $block_ua 0; if ($http_user_agent ~* "(HTTrack|Apache-HttpClient|harvest|audit|dirbuster|pangolin|nmap|sqln|hydra|Parser|libwww|BBBike|sqlmap|w3af|owasp|Nikto|fimap|havij|zmeu|BabyKrokodil|netsparker|httperf|SF|AI2Bot|Ai2Bot-Dolma|aiHitBot|ChatGPT-User|ClaudeBot|cohere-ai|cohere-training-data-crawler|Diffbot|DuckAssistBot|GPTBot|img2dataset|OAI-SearchBot|Perplexity-User|PerplexityBot|Scrapy|TikTokSpider|VelenPublicWebCrawler)") { set $block_ua 1; } # 放行合法搜索引擎（百度、谷歌等） if ($http_user_agent ~* "(Baiduspider|Googlebot|bingbot|YandexBot|Sogou web spider|Bytespider)") { set $block_ua 0; } # 针对原图目录（/upload/）强化拦截（仅拦截恶意 UA，不影响正常用户） location ~* ^/upload/ { if ($block_ua = 1) { return 403; } try_files $uri $uri/ =404; } # ------------------------ 对动态页面限流（仅百度蜘蛛受影响） ------------------------ location ~* ^/(picture.php|index.php) { # 直接应用限流（仅当 $is_baidu_spider=1 时，限流生效） limit_req zone=baidu_spider burst=20 nodelay; # 原有 PHP 处理逻辑（如 include enable-php-84.conf） include enable-php-84.conf; } # ------------------------ 其他配置 ------------------------

缩略图什么的是我网站使用的，根据实际情况修改。
方案三拉黑搜素引擎和 AI 蜘蛛的 IP 段（会导致网站内容不被收录）网站缓过来了在解除试下

进入原文参与互动