php 如何用php检测搜索引擎机器人?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/677419/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
how to detect search engine bots with php?
提问by terrific
How can one detect the search engine bots using php?
如何使用 php 检测搜索引擎机器人?
采纳答案by ólafur Waage
Here's a Search Engine Directory of Spider names
Then you use $_SERVER['HTTP_USER_AGENT'];to check if the agent is said spider.
然后你$_SERVER['HTTP_USER_AGENT'];用来检查代理是否被称为蜘蛛。
if(strstr(strtolower($_SERVER['HTTP_USER_AGENT']), "googlebot"))
{
// what to do
}
回答by minnur
I use the following code which seems to be working fine:
我使用以下似乎工作正常的代码:
function _bot_detected() {
return (
isset($_SERVER['HTTP_USER_AGENT'])
&& preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT'])
);
}
update 16-06-2017 https://support.google.com/webmasters/answer/1061943?hl=en
更新 16-06-2017 https://support.google.com/webmasters/answer/1061943?hl=en
added mediapartners
添加了媒体合作伙伴
回答by Jukka Dahlbom
Check the $_SERVER['HTTP_USER_AGENT']for some of the strings listed here:
检查$_SERVER['HTTP_USER_AGENT']此处列出的一些字符串:
http://www.useragentstring.com/pages/useragentstring.php
http://www.useragentstring.com/pages/useragentstring.php
Or more specifically for crawlers:
或更具体地说,对于爬虫:
http://www.useragentstring.com/pages/useragentstring.php?typ=Crawler
http://www.useragentstring.com/pages/useragentstring.php?typ=爬虫
If you want to -say- log the number of visits of most common search engine crawlers, you could use
如果你想记录最常见的搜索引擎爬虫的访问次数,你可以使用
$interestingCrawlers = array( 'google', 'yahoo' );
$pattern = '/(' . implode('|', $interestingCrawlers) .')/';
$matches = array();
$numMatches = preg_match($pattern, strtolower($_SERVER['HTTP_USER_AGENT']), $matches, 'i');
if($numMatches > 0) // Found a match
{
// $matches[1] contains an array of all text matches to either 'google' or 'yahoo'
}
回答by macherif
You can checkout if it's a search engine with this function :
您可以检查它是否是具有此功能的搜索引擎:
<?php
function crawlerDetect($USER_AGENT)
{
$crawlers = array(
'Google' => 'Google',
'MSN' => 'msnbot',
'Rambler' => 'Rambler',
'Yahoo' => 'Yahoo',
'AbachoBOT' => 'AbachoBOT',
'accoona' => 'Accoona',
'Actheitroadobot' => 'Actheitroadobot',
'ASPSeek' => 'ASPSeek',
'CrocCrawler' => 'CrocCrawler',
'Dumbot' => 'Dumbot',
'FAST-WebCrawler' => 'FAST-WebCrawler',
'GeonaBot' => 'GeonaBot',
'Gigabot' => 'Gigabot',
'Lycos spider' => 'Lycos',
'MSRBOT' => 'MSRBOT',
'Altavista robot' => 'Scooter',
'AltaVista robot' => 'Altavista',
'ID-Search Bot' => 'IDBot',
'eStyle Bot' => 'eStyle',
'Scrubby robot' => 'Scrubby',
'Facebook' => 'facebookexternalhit',
);
// to get crawlers string used in function uncomment it
// it is better to save it in string than use implode every time
// global $crawlers
$crawlers_agents = implode('|',$crawlers);
if (strpos($crawlers_agents, $USER_AGENT) === false)
return false;
else {
return TRUE;
}
}
?>
Then you can use it like :
然后你可以像这样使用它:
<?php $USER_AGENT = $_SERVER['HTTP_USER_AGENT'];
if(crawlerDetect($USER_AGENT)) return "no need to lang redirection";?>
回答by mgutt
I'm using this to detect bots:
我用它来检测机器人:
if (preg_match('/bot|crawl|curl|dataprovider|search|get|spider|find|java|majesticsEO|google|yahoo|teoma|contaxe|yandex|libwww-perl|facebookexternalhit/i', $_SERVER['HTTP_USER_AGENT'])) {
// is bot
}
In addition I use a whitelist to block unwanted bots:
此外,我使用白名单来阻止不需要的机器人:
if (preg_match('/apple|baidu|bingbot|facebookexternalhit|googlebot|-google|ia_archiver|msnbot|naverbot|pingdom|seznambot|slurp|teoma|twitter|yandex|yeti/i', $_SERVER['HTTP_USER_AGENT'])) {
// allowed bot
}
An unwanted bot (= false-positive user) is then able to solve a captcha to unblock himself for 24 hours. And as no one solves this captcha, I know it does not produce false-positives. So the bot detection seem to work perfectly.
然后,不需要的机器人(= 误报用户)能够解决验证码以在 24 小时内解锁自己。由于没有人解决这个验证码,我知道它不会产生误报。所以机器人检测似乎工作得很好。
Note: My whitelist is based on Facebooks robots.txt.
注意:我的白名单是基于Facebooks robots.txt 的。
回答by Fabian Kessler
Because any client can set the user-agent to what they want, looking for 'Googlebot', 'bingbot' etc is only half the job.
因为任何客户都可以将用户代理设置为他们想要的,所以寻找“Googlebot”、“bingbot”等只是工作的一半。
The 2nd part is verifying the client's IP. In the old days this required maintaining IP lists. All the lists you find online are outdated. The top search engines officially support verification through DNS, as explained by Google https://support.google.com/webmasters/answer/80553and Bing http://www.bing.com/webmaster/help/how-to-verify-bingbot-3905dc26
第二部分是验证客户端的IP。在过去,这需要维护 IP 列表。您在网上找到的所有列表都已过时。顶级搜索引擎正式支持通过 DNS 进行验证,如 Google https://support.google.com/webmasters/answer/80553和 Bing http://www.bing.com/webmaster/help/how-to-verify 所述-bingbot-3905dc26
At first perform a reverse DNS lookup of the client IP. For Google this brings a host name under googlebot.com, for Bing it's under search.msn.com. Then, because someone could set such a reverse DNS on his IP, you need to verify with a forward DNS lookup on that hostname. If the resulting IP is the same as the one of the site's visitor, you're sure it's a crawler from that search engine.
首先对客户端 IP 执行反向 DNS 查找。对于 Google,这会在 googlebot.com 下带来一个主机名,对于 Bing,它会在 search.msn.com 下。然后,因为有人可以在他的 IP 上设置这样的反向 DNS,您需要通过对该主机名的正向 DNS 查找进行验证。如果生成的 IP 与站点访问者的 IP 相同,则您确定它是来自该搜索引擎的爬虫。
I've written a library in Java that performs these checks for you. Feel free to port it to PHP. It's on GitHub: https://github.com/optimaize/webcrawler-verifier
我已经用 Java 编写了一个库来为您执行这些检查。随意将其移植到 PHP。它在 GitHub 上:https: //github.com/optimaize/webcrawler-verifier
回答by WonderLand
I use this function ... part of the regex comes from prestashop but I added some more bot to it.
我使用这个函数......正则表达式的一部分来自 prestashop,但我添加了更多的机器人。
public function isBot()
{
$bot_regex = '/BotLink|bingbot|AhrefsBot|ahoy|AlkalineBOT|anthill|appie|arale|araneo|AraybOt|ariadne|arks|ATN_Worldwide|Atomz|bbot|Bjaaland|Ukonline|borg\-bot\/0\.9|boxseabot|bspider|calif|christcrawler|CMC\/0\.01|combine|confuzzledbot|CoolBot|cosmos|Internet Cruiser Robot|cusco|cyberspyder|cydralspider|desertrealm, desert realm|digger|DIIbot|grabber|downloadexpress|DragonBot|dwcp|ecollector|ebiness|elfinbot|esculapio|esther|fastcrawler|FDSE|FELIX IDE|ESI|fido|H?m?h?kki|KIT\-Fireball|fouineur|Freecrawl|gammaSpider|gazz|gcreep|golem|googlebot|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|iajabot|INGRID\/0\.1|Informant|InfoSpiders|inspectorwww|irobot|Iron33|JBot|jcrawler|Teoma|Jeeves|jobo|image\.kapsi\.net|KDD\-Explorer|ko_yappo_robot|label\-grabber|larbin|legs|Linkidator|linkwalker|Lockon|logo_gif_crawler|marvin|mattie|mediafox|MerzScope|NEC\-MeshExplorer|MindCrawler|udmsearch|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|sharp\-info\-agent|WebMechanic|NetScoop|newscan\-online|ObjectsSearch|Occam|Orbsearch\/1\.0|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|Getterrobo\-Plus|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Search\-AU|searchprocess|Senrigan|Shagseeker|sift|SimBot|Site Valet|skymob|SLCrawler\/2\.0|slurp|ESI|snooper|solbot|speedy|spider_monkey|SpiderBot\/1\.0|spiderline|nil|suke|http:\/\/www\.sygol\.com|tach_bw|TechBOT|templeton|titin|topiclink|UdmSearch|urlck|Valkyrie libwww\-perl|verticrawl|Victoria|void\-bot|Voyager|VWbot_K|crawlpaper|wapspider|WebBandit\/1\.0|webcatcher|T\-H\-U\-N\-D\-E\-R\-S\-T\-O\-N\-E|WebMoose|webquest|webreaper|webs|webspider|WebWalker|wget|winona|whowhere|wlm|WOLP|WWWC|none|XGET|Nederland\.zoek|AISearchBot|woriobot|NetSeer|Nutch|YandexBot|YandexMobileBot|SemrushBot|FatBot|MJ12bot|DotBot|AddThis|baiduspider|SeznamBot|mod_pagespeed|CCBot|openstat.ru\/Bot|m2e/i';
$userAgent = empty($_SERVER['HTTP_USER_AGENT']) ? FALSE : $_SERVER['HTTP_USER_AGENT'];
$isBot = !$userAgent || preg_match($bot_regex, $userAgent);
return $isBot;
}
Anyway take care that some bots uses browser like user agent to fake their identity
( I got many russian ip that has this behaviour on my site )
无论如何要注意一些机器人使用像用户代理这样的浏览器来伪造他们的身份
(我的网站上有很多俄罗斯 ip 都有这种行为)
One distinctive feature of most of the bot is that they don't carry any cookie and so no session is attached to them.
( I am not sure how but this is for sure the best way to track them )
大多数机器人的一个显着特征是它们不携带任何 cookie,因此没有会话附加到它们。
(我不确定如何,但这肯定是跟踪它们的最佳方式)
回答by Gumbo
You could analyse the user agent ($_SERVER['HTTP_USER_AGENT']) or compare the client's IP address ($_SERVER['REMOTE_ADDR']) with a list of IP addresses of search engine bots.
您可以分析用户代理 ( $_SERVER['HTTP_USER_AGENT']) 或将客户端的 IP 地址 ( $_SERVER['REMOTE_ADDR']) 与搜索引擎机器人的 IP 地址列表进行比较。
回答by mattab
Use Device Detector open source library, it offers a isBot() function: https://github.com/piwik/device-detector
使用 Device Detector 开源库,它提供了 isBot() 函数:https: //github.com/piwik/device-detector
回答by L. Cosio
<?php // IPCLOACK HOOK
if (CLOAKING_LEVEL != 4) {
$lastupdated = date("Ymd", filemtime(FILE_BOTS));
if ($lastupdated != date("Ymd")) {
$lists = array(
'http://labs.getyacg.com/spiders/google.txt',
'http://labs.getyacg.com/spiders/inktomi.txt',
'http://labs.getyacg.com/spiders/lycos.txt',
'http://labs.getyacg.com/spiders/msn.txt',
'http://labs.getyacg.com/spiders/altavista.txt',
'http://labs.getyacg.com/spiders/askjeeves.txt',
'http://labs.getyacg.com/spiders/wisenut.txt',
);
foreach($lists as $list) {
$opt .= fetch($list);
}
$opt = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $opt);
$fp = fopen(FILE_BOTS,"w");
fwrite($fp,$opt);
fclose($fp);
}
$ip = isset($_SERVER['REMOTE_ADDR']) ? $_SERVER['REMOTE_ADDR'] : '';
$ref = isset($_SERVER['HTTP_REFERER']) ? $_SERVER['HTTP_REFERER'] : '';
$agent = isset($_SERVER['HTTP_USER_AGENT']) ? $_SERVER['HTTP_USER_AGENT'] : '';
$host = strtolower(gethostbyaddr($ip));
$file = implode(" ", file(FILE_BOTS));
$exp = explode(".", $ip);
$class = $exp[0].'.'.$exp[1].'.'.$exp[2].'.';
$threshold = CLOAKING_LEVEL;
$cloak = 0;
if (stristr($host, "googlebot") && stristr($host, "inktomi") && stristr($host, "msn")) {
$cloak++;
}
if (stristr($file, $class)) {
$cloak++;
}
if (stristr($file, $agent)) {
$cloak++;
}
if (strlen($ref) > 0) {
$cloak = 0;
}
if ($cloak >= $threshold) {
$cloakdirective = 1;
} else {
$cloakdirective = 0;
}
}
?>
That would be the ideal way to cloak for spiders. It's from an open source script called [YACG] - http://getyacg.com
这将是为蜘蛛伪装的理想方式。它来自一个名为 [YACG] 的开源脚本 - http://getyacg.com
Needs a bit of work, but definitely the way to go.
需要一些工作,但绝对是要走的路。

