AI爬虫大规模爬取网站内容,导致网站打不开,附解决方案
我在昨天时遇到了与@简丶 一样的情况,AI工具的爬虫疯狂抓取网站内容,导致网站一直加载中,就是无法正常访问(看日志才发现早就被爬了,只是网站没有挂才没发现), 搜解决方法的时候看到了这篇帖子,跟着进行了操作,同时使用豆包AI进行分析,结合豆包给出的方案一起使用,这里分享下
先看日志截图
方案一先是通过宝塔面板 Nginx 免费防火墙插件的 User-Agent 过滤了 AI 爬虫,参考了@雨天榕树 大佬在评论区分享的资料链接:https://www.52txr.cn/2025/banaicurl.html
有我根据自身情况新增的爬虫
又是用 robots.txt 限制 AI 爬虫和百度的爬取频率结果发现还是打不开
方案二(一起使用)豆包给的方案:宝塔面板全局的 NGINX 配置文件中添加(在 http { 内添加)然后到网站配置规则里添加(在 server { 内添加)缩略图什么的是我网站使用的,根据实际情况修改。
方案三拉黑搜素引擎和 AI 蜘蛛的 IP 段(会导致网站内容不被收录)网站缓过来了在解除试下
先看日志截图
方案一先是通过宝塔面板 Nginx 免费防火墙插件的 User-Agent 过滤了 AI 爬虫,参考了@雨天榕树 大佬在评论区分享的资料链接:https://www.52txr.cn/2025/banaicurl.html
有我根据自身情况新增的爬虫
- (ScrapyIAwarioBotIAI2Bot|Ai2Bot-Dolma|aiHitBot|anthropic-ai|ChatGPT-User|Claude-Web|ClaudeBot|cohere-ai|cohere-training-data-crawler|Diffbot|DuckAssistBot|GPTBot|img2dataset|OAI-SearchBot|Perplexity-User|PerplexityBot|PetalBot|Scrapy|SemrushBot-OCOB|TikTokSpider|VelenPublicWebCrawler|YouBot)
又是用 robots.txt 限制 AI 爬虫和百度的爬取频率
- # 百度蜘蛛:允许访问,但限制抓取间隔User-Agent: BaiduspiderCrawl-delay: 5# AI爬虫及特殊工具:禁止访问整个网站User-Agent: ScrapyDisallow: /User-Agent: AwarioBotIDisallow: /User-agent: SemrushBot-BADisallow: /User-agent: SemrushBot-SIDisallow: /User-agent: SemrushBot-SWADisallow: /User-agent: SplitSignalBotDisallow: /User-agent: SemrushBot-OCOBDisallow: /User-agent: SemrushBot-FTDisallow: /User-Agent: AI2BotDisallow: /User-Agent: Ai2Bot-DolmaDisallow: /User-Agent: aiHitBotDisallow: /User-Agent: AmazonbotDisallow: /User-Agent: anthropic-aiDisallow: /User-Agent: ApplebotDisallow: /User-Agent: Applebot-ExtendedDisallow: /User-Agent: Brightbot 1.0Disallow: /User-Agent: BytespiderDisallow: /User-Agent: CCBotDisallow: /User-Agent: ChatGPT-UserDisallow: /User-Agent: Claude-WebDisallow: /User-Agent: ClaudeBotDisallow: /User-Agent: cohere-aiDisallow: /User-Agent: cohere-training-data-crawlerDisallow: /User-Agent: CotoyogiDisallow: /User-Agent: CrawlspaceDisallow: /User-Agent: DiffbotDisallow: /User-Agent: DuckAssistBotDisallow: /User-Agent: FacebookBotDisallow: /User-Agent: Factset_spyderbotDisallow: /User-Agent: FirecrawlAgentDisallow: /User-Agent: FriendlyCrawlerDisallow: /User-Agent: Google-ExtendedDisallow: /User-Agent: GoogleOtherDisallow: /User-Agent: GoogleOther-ImageDisallow: /User-Agent: GoogleOther-VideoDisallow: /User-Agent: GPTBotDisallow: /User-Agent: iaskspider/2.0Disallow: /User-Agent: ICC-CrawlerDisallow: /User-Agent: ImagesiftBotDisallow: /User-Agent: img2datasetDisallow: /User-Agent: imgproxyDisallow: /User-Agent: ISSCyberRiskCrawlerDisallow: /User-Agent: Kangaroo BotDisallow: /User-Agent: Meta-ExternalAgentDisallow: /User-Agent: Meta-ExternalFetcherDisallow: /User-Agent: NovaActDisallow: /User-Agent: OAI-SearchBotDisallow: /User-Agent: omgiliDisallow: /User-Agent: omgilibotDisallow: /User-Agent: OperatorDisallow: /User-Agent: PanguBotDisallow: /User-Agent: Perplexity-UserDisallow: /User-Agent: PerplexityBotDisallow: /User-Agent: PetalBotDisallow: /User-Agent: ScrapyDisallow: /User-Agent: SemrushBot-OCOBDisallow: /User-Agent: SemrushBot-SWADisallow: /User-Agent: Sidetrade indexer botDisallow: /User-Agent: TikTokSpiderDisallow: /User-Agent: TimpibotDisallow: /User-Agent: VelenPublicWebCrawlerDisallow: /User-Agent: Webzio-ExtendedDisallow: /User-Agent: YouBotDisallow: /
方案二(一起使用)豆包给的方案:宝塔面板全局的 NGINX 配置文件中添加(在 http { 内添加)
- # 1. 定义百度蜘蛛的User-Agent匹配规则(必须在http块内) map $http_user_agent $is_baidu_spider { default 0; "~*Baiduspider" 1; # 匹配百度蜘蛛的 User-Agent } # 2. 定义限流区域(限制百度蜘蛛的请求频率) limit_req_zone $binary_remote_addr$is_baidu_spider zone=baidu_spider:10m rate=100r/m; # rate=300r/m:每个IP每分钟最多300次请求(可根据服务器性能调整)
- # ------------------------ 缩略图专用优化(匹配完整路径) ------------------------ # 匹配 /_data/i/upload/ 目录下的所有图片文件(含时间子目录,如 /2024/08/08/) location ~* ^/_data/i/upload/.*\.(jpg|jpeg|png|webp|avif|heic|heif)$ { # 强缓存1年(CDN/浏览器均可缓存) add_header Cache-Control "public, max-age=31536000"; # 兼容旧浏览器(30天缓存) expires 30d; # 关闭缩略图访问日志(减少磁盘IO) access_log /dev/null; # 继承全局防盗链规则(非法 Referer 已被拦截,无需重复判断) } # ------------------------ AI 爬虫与原图保护 ------------------------ # 定义需拦截的 User-Agent(AI 爬虫 + 恶意工具) set $block_ua 0; if ($http_user_agent ~* "(HTTrack|Apache-HttpClient|harvest|audit|dirbuster|pangolin|nmap|sqln|hydra|Parser|libwww|BBBike|sqlmap|w3af|owasp|Nikto|fimap|havij|zmeu|BabyKrokodil|netsparker|httperf|SF|AI2Bot|Ai2Bot-Dolma|aiHitBot|ChatGPT-User|ClaudeBot|cohere-ai|cohere-training-data-crawler|Diffbot|DuckAssistBot|GPTBot|img2dataset|OAI-SearchBot|Perplexity-User|PerplexityBot|Scrapy|TikTokSpider|VelenPublicWebCrawler)") { set $block_ua 1; } # 放行合法搜索引擎(百度、谷歌等) if ($http_user_agent ~* "(Baiduspider|Googlebot|bingbot|YandexBot|Sogou web spider|Bytespider)") { set $block_ua 0; } # 针对原图目录(/upload/)强化拦截(仅拦截恶意 UA,不影响正常用户) location ~* ^/upload/ { if ($block_ua = 1) { return 403; } try_files $uri $uri/ =404; } # ------------------------ 对动态页面限流(仅百度蜘蛛受影响) ------------------------ location ~* ^/(picture.php|index.php) { # 直接应用限流(仅当 $is_baidu_spider=1 时,限流生效) limit_req zone=baidu_spider burst=20 nodelay; # 原有 PHP 处理逻辑(如 include enable-php-84.conf) include enable-php-84.conf; } # ------------------------ 其他配置 ------------------------
方案三拉黑搜素引擎和 AI 蜘蛛的 IP 段(会导致网站内容不被收录)网站缓过来了在解除试下