C# 检测诚实的网络爬虫
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/544450/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Detecting honest web crawlers
提问by JavadocMD
I would like to detect (on the server side) which requests are from bots. I don't care about malicious bots at this point, just the ones that are playing nice. I've seen a few approaches that mostly involve matching the user agent string against keywords like 'bot'. But that seems awkward, incomplete, and unmaintainable. So does anyone have any more solid approaches? If not, do you have any resources you use to keep up to date with all the friendly user agents?
我想检测(在服务器端)哪些请求来自机器人。在这一点上,我不关心恶意机器人,只关心那些玩得很好的机器人。我见过一些主要涉及将用户代理字符串与诸如“bot”之类的关键字进行匹配的方法。但这似乎很尴尬、不完整且无法维护。那么有人有更可靠的方法吗?如果没有,您是否有任何资源可以用来与所有友好的用户代理保持同步?
If you're curious: I'm not trying to do anything against any search engine policy. We have a section of the site where a user is randomly presented with one of several slightly different versions of a page. However if a web crawler is detected, we'd always give them the same version so that the index is consistent.
如果你很好奇:我不想做任何违反任何搜索引擎政策的事情。我们有一个站点的一部分,用户会随机看到页面的几个略有不同版本中的一个。但是,如果检测到网络爬虫,我们会始终为它们提供相同的版本,以便索引保持一致。
Also I'm using Java, but I would imagine the approach would be similar for any server-side technology.
我也在使用 Java,但我想这种方法对于任何服务器端技术都是相似的。
采纳答案by Sparr
You can find a very thorough database of data on known "good" web crawlers in the robotstxt.org Robots Database. Utilizing this data would be far more effective than just matching botin the user-agent.
您可以在 robotstxt.org机器人数据库 中找到有关已知“优秀”网络爬虫的非常详尽的数据数据库。利用这些数据比仅仅在用户代理中匹配机器人要有效得多。
回答by Sparr
Any visitor whose entry page is /robots.txt is probably a bot.
任何进入页面为 /robots.txt 的访问者都可能是机器人。
回答by Dscoduc
One suggestion is to create an empty anchor on your page that only a bot would follow. Normal users wouldn't see the link, leaving spiders and bots to follow. For example, an empty anchor tag that points to a subfolder would record a get request in your logs...
一个建议是在您的页面上创建一个只有机器人会跟随的空锚点。普通用户不会看到链接,留下蜘蛛和机器人跟随。例如,指向子文件夹的空锚标记会在您的日志中记录获取请求...
<a href="dontfollowme.aspx"></a>
Many people use this method while running a HoneyPot to catch malicious bots that aren't following the robots.txt file. I use the empty anchor method in an ASP.NET honeypot solutionI wrote to trap and block those creepy crawlers...
许多人在运行 HoneyPot 时使用此方法来捕获未跟踪 robots.txt 文件的恶意机器人。我在编写的ASP.NET 蜜罐解决方案中使用空锚方法来捕获和阻止那些令人毛骨悚然的爬虫......
回答by Brian Armstrong
Something quick and dirty like this might be a good start:
像这样快速而肮脏的事情可能是一个好的开始:
return if request.user_agent =~ /googlebot|msnbot|baidu|curl|wget|Mediapartners-Google|slurp|ia_archiver|Gigabot|libwww-perl|lwp-trivial/i
Note: rails code, but regex is generally applicable.
注意:rails 代码,但正则表达式普遍适用。
回答by Stewart McKee
I'm pretty sure a large proportion of bots don't use robots.txt, however that was my first thought.
我很确定很大一部分机器人不使用 robots.txt,但这是我的第一个想法。
It seems to me that the best way to detect a bot is with time between requests, if the time between requests is consistently fast then its a bot.
在我看来,检测机器人的最佳方法是使用请求之间的时间,如果请求之间的时间始终很快,那么它就是机器人。
回答by Dave Sumter
You said matching the user agent on ‘bot' may be awkward, but we've found it to be a pretty good match. Our studies have shown that it will cover about 98% of the hits you receive. We also haven't come across any false positive matches yet either. If you want to raise this up to 99.9% you can include a few other well-known matches such as ‘crawler', ‘baiduspider', ‘ia_archiver', ‘curl' etc. We've tested this on our production systems over millions of hits.
你说在“bot”上匹配用户代理可能很尴尬,但我们发现它是一个很好的匹配。我们的研究表明,它将覆盖您收到的大约 98% 的点击量。我们也还没有遇到任何误报。如果您想将其提高到 99.9%,您可以添加一些其他著名的匹配项,例如 'crawler'、'baiduspider'、'ia_archiver'、'curl' 等。我们已经在我们的生产系统上测试了数百万次的命中。
Here are a few c# solutions for you:
以下是一些适合您的 c# 解决方案:
1) Simplest
1) 最简单
Is the fastest when processing a miss. i.e. traffic from a non-bot – a normal user. Catches 99+% of crawlers.
处理未命中时最快。即来自非机器人的流量——普通用户。捕获 99+% 的爬虫。
bool iscrawler = Regex.IsMatch(Request.UserAgent, @"bot|crawler|baiduspider|80legs|ia_archiver|voyager|curl|wget|yahoo! slurp|mediapartners-google", RegexOptions.IgnoreCase);
2) Medium
2) 中
Is the fastest when processing a hit. i.e. traffic from a bot. Pretty fast for misses too. Catches close to 100% of crawlers. Matches ‘bot', ‘crawler', ‘spider' upfront. You can add to it any other known crawlers.
处理命中时速度最快。即来自机器人的流量。错过也很快。捕获接近 100% 的爬虫。预先匹配“bot”、“crawler”、“spider”。您可以向其中添加任何其他已知的爬虫。
List<string> Crawlers3 = new List<string>()
{
"bot","crawler","spider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google",
"lwp-trivial","nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne",
"atn_worldwide","atomz","bjaaland","ukonline","calif","combine","cosmos","cusco",
"cyberspyder","digger","grabber","downloadexpress","ecollector","ebiness","esculapio",
"esther","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm",
"gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","havindex","hotwired",
"htdig","ingrid","informant","inspectorwww","iron33","teoma","ask jeeves","jeeves",
"image.kapsi.net","kdd-explorer","label-grabber","larbin","linkidator","linkwalker",
"lockon","marvin","mattie","mediafox","merzscope","nec-meshexplorer","udmsearch","moget",
"motor","muncher","muninn","muscatferret","mwdsearch","sharp-info-agent","webmechanic",
"netscoop","newscan-online","objectssearch","orbsearch","packrat","pageboy","parasite",
"patric","pegasus","phpdig","piltdownman","pimptrain","plumtreewebaccessor","getterrobo-plus",
"raven","roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au",
"searchprocess","senrigan","shagseeker","site valet","skymob","slurp","snooper","speedy",
"curl_image_client","suke","www.sygol.com","tach_bw","templeton","titin","topiclink","udmsearch",
"urlck","valkyrie libwww-perl","verticrawl","victoria","webscout","voyager","crawlpaper",
"webcatcher","t-h-u-n-d-e-r-s-t-o-n-e","webmoose","pagesinventory","webquest","webreaper",
"webwalker","winona","occam","robi","fdse","jobo","rhcs","gazz","dwcp","yeti","fido","wlm",
"wolp","wwwc","xget","legs","curl","webs","wget","sift","cmc"
};
string ua = Request.UserAgent.ToLower();
bool iscrawler = Crawlers3.Exists(x => ua.Contains(x));
3) Paranoid
3) 偏执狂
Is pretty fast, but a little slower than options 1 and 2. It's the most accurate, and allows you to maintain the lists if you want. You can maintain a separate list of names with ‘bot' in them if you are afraid of false positives in future. If we get a short match we log it and check it for a false positive.
非常快,但比选项 1 和 2 慢一点。它是最准确的,并且允许您根据需要维护列表。如果您担心将来出现误报,您可以维护一个单独的名称列表,其中包含“bot”。如果我们得到一个短匹配,我们会记录它并检查它是否存在误报。
// crawlers that have 'bot' in their useragent
List<string> Crawlers1 = new List<string>()
{
"googlebot","bingbot","yandexbot","ahrefsbot","msnbot","linkedinbot","exabot","compspybot",
"yesupbot","paperlibot","tweetmemebot","semrushbot","gigabot","voilabot","adsbot-google",
"botlink","alkalinebot","araybot","undrip bot","borg-bot","boxseabot","yodaobot","admedia bot",
"ezooms.bot","confuzzledbot","coolbot","internet cruiser robot","yolinkbot","diibot","musobot",
"dragonbot","elfinbot","wikiobot","twitterbot","contextad bot","hambot","iajabot","news bot",
"irobot","socialradarbot","ko_yappo_robot","skimbot","psbot","rixbot","seznambot","careerbot",
"simbot","solbot","mail.ru_bot","spiderbot","blekkobot","bitlybot","techbot","void-bot",
"vwbot_k","diffbot","friendfeedbot","archive.org_bot","woriobot","crystalsemanticsbot","wepbot",
"spbot","tweetedtimes bot","mj12bot","who.is bot","psbot","robot","jbot","bbot","bot"
};
// crawlers that don't have 'bot' in their useragent
List<string> Crawlers2 = new List<string>()
{
"baiduspider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google","lwp-trivial",
"nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne","atn_worldwide","atomz",
"bjaaland","ukonline","bspider","calif","christcrawler","combine","cosmos","cusco","cyberspyder",
"cydralspider","digger","grabber","downloadexpress","ecollector","ebiness","esculapio","esther",
"fastcrawler","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm",
"gammaspider","gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","portalbspider",
"havindex","hotwired","htdig","ingrid","informant","infospiders","inspectorwww","iron33",
"jcrawler","teoma","ask jeeves","jeeves","image.kapsi.net","kdd-explorer","label-grabber",
"larbin","linkidator","linkwalker","lockon","logo_gif_crawler","marvin","mattie","mediafox",
"merzscope","nec-meshexplorer","mindcrawler","udmsearch","moget","motor","muncher","muninn",
"muscatferret","mwdsearch","sharp-info-agent","webmechanic","netscoop","newscan-online",
"objectssearch","orbsearch","packrat","pageboy","parasite","patric","pegasus","perlcrawler",
"phpdig","piltdownman","pimptrain","pjspider","plumtreewebaccessor","getterrobo-plus","raven",
"roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au","searchprocess",
"senrigan","shagseeker","site valet","skymob","slcrawler","slurp","snooper","speedy",
"spider_monkey","spiderline","curl_image_client","suke","www.sygol.com","tach_bw","templeton",
"titin","topiclink","udmsearch","urlck","valkyrie libwww-perl","verticrawl","victoria",
"webscout","voyager","crawlpaper","wapspider","webcatcher","t-h-u-n-d-e-r-s-t-o-n-e",
"webmoose","pagesinventory","webquest","webreaper","webspider","webwalker","winona","occam",
"robi","fdse","jobo","rhcs","gazz","dwcp","yeti","crawler","fido","wlm","wolp","wwwc","xget",
"legs","curl","webs","wget","sift","cmc"
};
string ua = Request.UserAgent.ToLower();
string match = null;
if (ua.Contains("bot")) match = Crawlers1.FirstOrDefault(x => ua.Contains(x));
else match = Crawlers2.FirstOrDefault(x => ua.Contains(x));
if (match != null && match.Length < 5) Log("Possible new crawler found: ", ua);
bool iscrawler = match != null;
Notes:
笔记:
- It's tempting to just keep adding names to the regex option 1. But if you do this it will become slower. If you want a more complete list then linq with lambda is faster.
- Make sure .ToLower() is outside of your linq method – remember the method is a loop and you would be modifying the string during each iteration.
- Always put the heaviest bots at the start of the list, so they match sooner.
- Put the lists into a static class so that they are not rebuilt on every pageview.
- 继续向正则表达式选项 1 添加名称很诱人。但如果这样做,它会变得更慢。如果您想要更完整的列表,那么带有 lambda 的 linq 会更快。
- 确保 .ToLower() 在你的 linq 方法之外——记住该方法是一个循环,你将在每次迭代期间修改字符串。
- 始终将最重的机器人放在列表的开头,以便它们更快地匹配。
- 将列表放入一个静态类中,这样它们就不会在每次页面浏览时重新构建。
Honeypots
蜜罐
The only real alternative to this is to create a ‘honeypot' link on your site that only a bot will reach. You then log the user agent strings that hit the honeypot page to a database. You can then use those logged strings to classify crawlers.
唯一真正的替代方法是在您的网站上创建一个只有机器人才能访问的“蜜罐”链接。然后将访问蜜罐页面的用户代理字符串记录到数据库中。然后,您可以使用这些记录的字符串对爬虫进行分类。
Postives:
It will match some unknown crawlers that aren't declaring themselves.
Postives:
它将匹配一些未声明自己的未知爬虫。
Negatives:
Not all crawlers dig deep enough to hit every link on your site, and so they may not reach your honeypot.
Negatives:
并非所有爬虫都挖掘得足够深,可以访问您网站上的每个链接,因此它们可能无法到达您的蜜罐。