php 如何使用php识别机器人?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/422969/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to recognize bots with php?
提问by Hugo Gameiro
I am building stats for my users and dont wish the visits from bots to be counted.
我正在为我的用户构建统计数据,并且不希望计算机器人的访问量。
Now I have a basic php with mysql increasing 1 each time the page is called.
现在我有一个基本的 php,每次调用页面时 mysql 都会增加 1。
But bots are also added to the count.
但机器人也被添加到计数中。
Does anyone can think of a way?
有没有人能想到办法?
Mainly is just the major ones that mess things up. Google, Yahoo, Msn, etc.
主要是那些把事情搞砸的主要因素。谷歌、雅虎、MSN等
采纳答案by ine
You should filter by user-agent strings. You can find a list of about 300 common user-agents given by bots here: http://www.robotstxt.org/db.htmlRunning through that list and ignoring bot user-agents before you run your SQL statement should solve your problem for all practical purposes.
您应该按用户代理字符串进行过滤。您可以在此处找到机器人提供的大约 300 个常见用户代理的列表:http: //www.robotstxt.org/db.html 在运行 SQL 语句之前运行该列表并忽略机器人用户代理应该可以解决您的问题用于所有实际目的。
If you don't want the search engines to even reach the page, use a basic robots.txtfile to block them.
如果您甚至不希望搜索引擎访问该页面,请使用基本的robots.txt文件来阻止它们。
回答by Rob
You can check the User Agent string, empty strings, or strings containing 'robot', 'spider', 'crawler', 'curl' are likely to be robots.
您可以检查用户代理字符串,空字符串或包含'robot'、'spider'、'crawler'、'curl'的字符串很可能是robots。
preg_match('/robot|spider|crawler|curl|^$/i', $_SERVER['HTTP_USER_AGENT']));
preg_match('/robot|spider|crawler|curl|^$/i', $_SERVER['HTTP_USER_AGENT']));
回答by ConroyP
We've a similar use-case to yourself, and one option we've recently found quite helpful is the UASParser classfrom user-agent-string.info.
我们还有一类似用途的情况下给自己,并选择一个我们最近发现非常有帮助的是UASParser类从user-agent-string.info。
It's a PHP class which pulls the latest set of user agent string definitions and caches them locally. The class can be configured to pull the definitions as often or as rarely as you deem fit. Automatically fetching them like this means that you don't have to keep on top of the various changes to bot user agents or new ones coming on the market, although you are relying on UAS.info to do this accurately.
它是一个 PHP 类,它提取最新的用户代理字符串定义集并在本地缓存它们。该类可以配置为根据您认为合适的频率或很少提取定义。像这样自动获取它们意味着您不必关注机器人用户代理的各种变化或市场上即将推出的新产品,尽管您依赖 UAS.info 来准确地做到这一点。
When the class is called, it parses the current visitor's user agent and returns an associative array breaking out the constituent parts, e.g.
当这个类被调用时,它解析当前访问者的用户代理并返回一个关联数组,分解组成部分,例如
Array
(
[typ] => browser
[ua_family] => Firefox
[ua_name] => Firefox 3.0.8
[ua_url] => http://www.mozilla.org/products/firefox/
[ua_company] => Mozilla Foundation
........
[os_company] => Microsoft Corporation.
[os_company_url] => http://www.microsoft.com/
[os_icon] => windowsxp.png
)
The field typis set to browserwhen the UA is identified as likely belonging to a human visitor, in which case you can update your stats.
当 UA 被识别为可能属于人类访问者时,该字段typ设置为browser,在这种情况下,您可以更新您的统计信息。
Couple of caveats here:
这里有几个警告:
- You're relying on UAS.info for the user agent strings provided to be accurate and up-to-date
- Bots like google and yahoo declare themselves in their user agent strings, but this method will still count visits from bots pretending to be human visitors (sending spoofed UAs)
- As @amdfanmentioned above, blocking bots via robots.txt should stop most of them from reaching your page. If you need the content to be indexed but not increment stats, then the robots.txt method won't be a realistic option
- 您依赖 UAS.info 来提供准确和最新的用户代理字符串
- 像谷歌和雅虎这样的机器人在他们的用户代理字符串中声明自己,但这种方法仍然会计算伪装成人类访问者的机器人的访问(发送欺骗性的 UA)
- 正如@amdfan上面提到的,通过 robots.txt 阻止机器人应该可以阻止大多数机器人访问您的页面。如果您需要将内容编入索引但不增加统计信息,那么 robots.txt 方法将不是一个现实的选择
回答by Rob
Check the user agentbefore incrementing the page view count, but remember that this can be spoofed. PHP exposes the user agent in $_SERVER['HTTP_USER_AGENT'], assuming that the web server provides it with this information. More information about $_SERVERcan be found at http://www.php.net/manual/en/reserved.variables.server.php.
在增加页面查看计数之前检查用户代理,但请记住,这可能会被欺骗。PHP 在 中公开用户代理$_SERVER['HTTP_USER_AGENT'],假设 Web 服务器向其提供此信息。有关更多信息$_SERVER,请访问 http://www.php.net/manual/en/reserved.variables.server.php。
You can find a list of user agents at http://www.user-agents.org; Googling will also provide the names of those belonging to the major providers. A third possible source would be your web server's access logs, if you can aggregate them.
您可以在http://www.user-agents.org找到用户代理列表;谷歌搜索还将提供属于主要供应商的名称。第三个可能的来源是您的 Web 服务器的访问日志(如果您可以聚合它们)。
回答by Simon B. Jensen
Have you tried identifying them by their user-agent information? A simple google search should give you the user-agents used by Google etc.
您是否尝试过通过用户代理信息来识别他们?一个简单的谷歌搜索应该给你谷歌等使用的用户代理。
This, of course, is not foolproof, but most crawlers by major companies supply a distinct user-agent.
当然,这不是万无一失的,但是大公司的大多数爬虫都提供了一个独特的用户代理。
EDIT: Assuming you do not want to restrict the bots access, but just not count its visit in your statistc.
编辑:假设您不想限制机器人访问,但只是不将其访问计入您的 statistc。
回答by Irshad Khan
100% Working Bot detector.It is working on my website to detect robots, crawlers, spiders, and copiers.
100% 工作机器人检测器。它正在我的网站上检测机器人、爬虫、蜘蛛和复印机。
function isBotDetected() {
if ( preg_match('/abacho|accona|AddThis|AdsBot|ahoy|AhrefsBot|AISearchBot|alexa|altavista|anthill|appie|applebot|arale|araneo|AraybOt|ariadne|arks|aspseek|ATN_Worldwide|Atomz|baiduspider|baidu|bbot|bingbot|bing|Bjaaland|BlackWidow|BotLink|bot|boxseabot|bspider|calif|CCBot|ChinaClaw|christcrawler|CMC\/0\.01|combine|confuzzledbot|contaxe|CoolBot|cosmos|crawler|crawlpaper|crawl|curl|cusco|cyberspyder|cydralspider|dataprovider|digger|DIIbot|DotBot|downloadexpress|DragonBot|DuckDuckBot|dwcp|EasouSpider|ebiness|ecollector|elfinbot|esculapio|ESI|esther|eStyle|Ezooms|facebookexternalhit|facebook|facebot|fastcrawler|FatBot|FDSE|FELIX IDE|fetch|fido|find|Firefly|fouineur|Freecrawl|froogle|gammaSpider|gazz|gcreep|geona|Getterrobo-Plus|get|girafabot|golem|googlebot|\-google|grabber|GrabNet|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|HTTrack|ia_archiver|iajabot|IDBot|Informant|InfoSeek|InfoSpiders|INGRID\/0\.1|inktomi|inspectorwww|Internet Cruiser Robot|irobot|Iron33|JBot|jcrawler|Jeeves|jobo|KDD\-Explorer|KIT\-Fireball|ko_yappo_robot|label\-grabber|larbin|legs|libwww-perl|linkedin|Linkidator|linkwalker|Lockon|logo_gif_crawler|Lycos|m2e|majesticsEO|marvin|mattie|mediafox|mediapartners|MerzScope|MindCrawler|MJ12bot|mod_pagespeed|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|NationalDirectory|naverbot|NEC\-MeshExplorer|NetcraftSurveyAgent|NetScoop|NetSeer|newscan\-online|nil|none|Nutch|ObjectsSearch|Occam|openstat.ru\/Bot|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pingdom|pinterest|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|rambler|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Scrubby|Search\-AU|searchprocess|search|SemrushBot|Senrigan|seznambot|Shagseeker|sharp\-info\-agent|sift|SimBot|Site Valet|SiteSucker|skymob|SLCrawler\/2\.0|slurp|snooper|solbot|speedy|spider_monkey|SpiderBot\/1\.0|spiderline|spider|suke|tach_bw|TechBOT|TechnoratiSnoop|templeton|teoma|titin|topiclink|twitterbot|twitter|UdmSearch|Ukonline|UnwindFetchor|URL_Spider_SQL|urlck|urlresolver|Valkyrie libwww\-perl|verticrawl|Victoria|void\-bot|Voyager|VWbot_K|wapspider|WebBandit\/1\.0|webcatcher|WebCopier|WebFindBot|WebLeacher|WebMechanic|WebMoose|webquest|webreaper|webspider|webs|WebWalker|WebZip|wget|whowhere|winona|wlm|WOLP|woriobot|WWWC|XGET|xing|yahoo|YandexBot|YandexMobileBot|yandex|yeti|Zeus/i', $_SERVER['HTTP_USER_AGENT'])
) {
return true; // 'Above given bots detected'
}
return false;
} // End :: isBotDetected()

![php $_POST 与 $_SERVER['REQUEST_METHOD'] == 'POST'](/res/img/loading.gif)