php 如何识别网络爬虫?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/8404775/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-26 04:37:12  来源:igfitidea点击:

How to identify web-crawler?

phpweb-crawler

提问by clarkk

How can I filter out hits from webcrawlers etc. Hits which not is human..

如何过滤来自网络爬虫等的点击。非人类点击。

I use maxmind.com to request the city from the IP.. It is not quite cheap if I have to pay for ALL hits including webcrawlers, robots etc.

我使用 maxmind.com 从 IP 请求城市.. 如果我必须为包括网络爬虫、机器人等在内的所有点击付费,这不是很便宜。

回答by Kiril

There are two general ways to detect robots and I would call them "Polite/Passive" and "Aggressive". Basically, you have to give your web site a psychological disorder.

有两种检测机器人的通用方法,我将它们称为“礼貌/被动”和“积极”。基本上,您必须给您的网站带来心理障碍。

Polite

有礼貌的

These are ways to politely tell crawlers that they shouldn't crawl your site and to limit how often you are crawled. Politeness is ensured through robots.txtfile in which you specify which bots, if any, should be allowed to crawl your website and how often your website can be crawled. This assumes that the robot you're dealing with is polite.

这些是礼貌地告诉抓取工具他们不应该抓取您的网站并限制抓取您的频率的方法。礼貌是通过robots.txt文件确保的,您可以在该文件中指定应允许哪些机器人(如果有)抓取您的网站以及可以抓取您的网站的频率。这假设您正在处理的机器人是有礼貌的。

Aggressive

挑衅的

Another way to keep bots off your site is to get aggressive.

另一种让机器人远离您的网站的方法是积极进取。

User Agent

用户代理

Some aggressive behavior includes (as previously mentioned by other users) the filtering of user-agent strings. This is probably the simplest, but also the least reliable way to detect if it's a user or not. A lot of bots tend to spoof user agents and some do it for legitimate reasons (i.e. they only want to crawl mobile content), while others simply don't want to be identified as bots. Even worse, some bots spoof legitimate/polite bot agents, such as the user agents of google, microsoft, lycos and other crawlers which are generally considered polite. Relying on the user agent can be helpful, but not by itself.

一些攻击性行为包括(如其他用户之前提到的)用户代理字符串的过滤。这可能是最简单但也是最不可靠的方法来检测它是否是用户。许多机器人倾向于欺骗用户代理,有些是出于合法原因(即他们只想抓取移动内容),而其他机器人只是不想被识别为机器人。更糟糕的是,一些机器人欺骗合法/礼貌的机器人代理,例如谷歌、微软、lycos 和其他通常被认为是礼貌的爬虫的用户代理。依靠用户代理可能会有所帮助,但它本身并没有帮助。

There are more aggressive ways to deal with robots that spoof user agents AND don't abide by your robots.txt file:

有更积极的方法来处理欺骗用户代理并且不遵守您的 robots.txt 文件的机器人:

Bot Trap

机器人陷阱

I like to think of this as a "Venus Fly Trap," and it basically punishes any bot that wants to play tricks with you.

我喜欢将其视为“维纳斯捕蝇器”,它基本上会惩罚任何想与您玩弄花招的机器人。

A bot trap is probably the most effective way to find bots that don't adhere to your robots.txt file without actually impairing the usability of your website. Creating a bot trap ensures that only bots are captured and not real users. The basic way to do it is to setup a directory which you specifically mark as off limits in your robots.txt file, so any robot that is polite will not fall into the trap. The second thing you do is to place a "hidden" link from your website to the bot trap directory (this ensures that real users will never go there, since real users never click on invisible links). Finally, you ban any IP address that goes to the bot trap directory.

机器人陷阱可能是查找不遵守 robots.txt 文件而不实际损害网站可用性的机器人的最有效方法。创建机器人陷阱可确保仅捕获机器人而不是真实用户。这样做的基本方法是设置一个目录,您在 robots.txt 文件中专门将其标记为禁止访问,这样任何有礼貌的机器人都不会落入陷阱。您要做的第二件事是在您的网站和机器人陷阱目录之间放置一个“隐藏”链接(这可以确保真实用户永远不会去那里,因为真实用户永远不会点击隐形链接)。最后,您禁止进入机器人陷阱目录的任何 IP 地址。

Here are some instructions on how to achieve this: Create a bot trap(or in your case: a PHP bot trap).

以下是有关如何实现此目标的一些说明: 创建一个机器人陷阱(或在您的情况下:一个PHP 机器人陷阱)。

Note: of course, some bots are smart enough to read your robots.txt file, see all the directories which you've marked as "off limits" and STILL ignore your politeness settings (such as crawl rate and allowed bots). Those bots will probably not fall into your bot trap despite the fact that they are not polite.

注意:当然,有些机器人足够聪明,可以读取您的 robots.txt 文件,查看您标记为“禁止访问”的所有目录,并且仍然会忽略您的礼貌设置(例如抓取速度和允许的机器人)。尽管这些机器人并不礼貌,但它们可能不会落入您的机器人陷阱。

Violent

暴力

I think this is actually too aggressive for the general audience (and general use), so if there are any kids under the age of 18, then please take them to another room!

我认为这对于一般观众(和一般用途)来说实际上太激进了,所以如果有任何 18 岁以下的孩子,那么请把他们带到另一个房间!

You can make the bot trap "violent" by simply not specifying a robots.txt file. In this situation ANY BOTthat crawls the hidden links will probably end up in the bot trap and you can ban all bots, period!

您可以通过简单地不指定 robots.txt 文件来使机器人陷阱变得“暴力”。在这种情况下,任何爬行隐藏链接的BOT都可能最终陷入机器人陷阱,您可以禁止所有机器人,期间!

The reason this is not recommended is that you may actually want some bots to crawl your website (such as Google, Microsoft or other bots for site indexing). Allowing your website to be politely crawled by the bots from Google, Microsoft, Lycos, etc. will ensure that your site gets indexed and it shows up when people search for it on their favorite search engine.

不推荐这样做的原因是您实际上可能希望一些机器人抓取您的网站(例如 Google、Microsoft 或其他用于站点索引的机器人)。允许您的网站被来自谷歌、微软、Lycos 等的机器人礼貌地抓取将确保您的网站被编入索引,并在人们在他们最喜欢的搜索引擎上搜索它时显示出来。

Self Destructive

自毁

Yet another way to limits what bots can crawl on your website, is to serve CAPTCHAs or other challenges which a bot cannot solve. This comes at an expense of your users and I would think that anything which makes your website less usable (such as a CAPTCHA) is "self destructive." This, of course, will not actually block the bot from repeatedly trying to crawl your website, it will simply make your website very uninteresting to them. There are ways to "get around" the CAPTCHAs, but they're difficult to implement so I'm not going to delve into this too much.

限制机器人可以在您的网站上爬行的另一种方法是提供验证码或机器人无法解决的其他挑战。这是以您的用户为代价的,我认为任何使您的网站不那么可用的东西(例如 CAPTCHA)都是“自我毁灭性的”。当然,这实际上不会阻止机器人反复尝试抓取您的网站,只会让他们对您的网站非常无趣。有一些方法可以“绕过”验证码,但它们很难实施,所以我不会深入研究这个问题。

Conclusion

结论

For your purposes, probably the best way to deal with bots is to employ a combination of the above mentioned strategies:

出于您的目的,处理机器人的最佳方法可能是采用上述策略的组合:

  1. Filter user agents.
  2. Setup a bot trap (the violent one).
  1. 过滤用户代理。
  2. 设置一个机器人陷阱(暴力陷阱)。

Catch all the bots that go into the violent bot trap and simply black-list their IPs (but don't block them). This way you will still get the "benefits" of being crawled by bots, but you will not have to pay to check the IP addresses that are black-listed due to going to your bot trap.

抓住所有进入暴力机器人陷阱的机器人,然后简单地将它们的 IP 列入黑名单(但不要阻止它们)。通过这种方式,您仍将获得被机器人抓取的“好处”,但您无需支付检查因进入机器人陷阱而被列入黑名单的 IP 地址的费用。

回答by Sudhir Bastakoti

You can check USER_AGENT, something like:

您可以检查 USER_AGENT,例如:

function crawlerDetect($USER_AGENT)
{
    $crawlers = array(
    array('Google', 'Google'),
    array('msnbot', 'MSN'),
    array('Rambler', 'Rambler'),
    array('Yahoo', 'Yahoo'),
    array('AbachoBOT', 'AbachoBOT'),
    array('accoona', 'Accoona'),
    array('Actheitroadobot', 'Actheitroadobot'),
    array('ASPSeek', 'ASPSeek'),
    array('CrocCrawler', 'CrocCrawler'),
    array('Dumbot', 'Dumbot'),
    array('FAST-WebCrawler', 'FAST-WebCrawler'),
    array('GeonaBot', 'GeonaBot'),
    array('Gigabot', 'Gigabot'),
    array('Lycos', 'Lycos spider'),
    array('MSRBOT', 'MSRBOT'),
    array('Scooter', 'Altavista robot'),
    array('AltaVista', 'Altavista robot'),
    array('IDBot', 'ID-Search Bot'),
    array('eStyle', 'eStyle Bot'),
    array('Scrubby', 'Scrubby robot')
    );

    foreach ($crawlers as $c)
    {
        if (stristr($USER_AGENT, $c[0]))
        {
            return($c[1]);
        }
    }

    return false;
}

// example

$crawler = crawlerDetect($_SERVER['HTTP_USER_AGENT']);

回答by Carl Zulauf

The user agent ($_SERVER['HTTP_USER_AGENT']) often identifies whether the connecting agent is a browser or a robot. Review logs/analytics for the user agents of crawlers that visit your site. Filter accordingly.

用户代理 ( $_SERVER['HTTP_USER_AGENT']) 通常会识别连接代理是浏览器还是机器人。查看访问您网站的爬虫的用户代理的日志/分析。相应地过滤。

Take note that the user agent is a header supplied by the client application. As such it can be pretty much anything and shouldn't be trusted 100%. Plan accordingly.

请注意,用户代理是客户端应用程序提供的标头。因此,它几乎可以是任何东西,不应 100% 信任。相应地计划。

回答by Kris Craig

Checking the User-Agent will protect you from legitimate bots like Google and Yahoo.

检查用户代理将保护您免受谷歌和雅虎等合法机器人的攻击。

However, if you're also being hit with spam bots, then chances are User-Agent comparison won't protect you since those bots typically forge a common User-Agent string anyway. In that instance, you would need to imploy more sophisticated measures. If user input is required, a simple image verification scheme like ReCaptcha or phpMeow will work.

但是,如果您也受到垃圾邮件机器人的攻击,那么 User-Agent 比较可能无法保护您,因为这些机器人通常会伪造一个通用的 User-Agent 字符串。在这种情况下,您将需要采用更复杂的措施。如果需要用户输入,可以使用像 ReCaptcha 或 phpMeow 这样的简单图像验证方案。

If you're looking to filter out all page hits from a bot, unfortunately, there's no 100% reliable way to do this if the bot is forging its credentials. This is just an annoying fact of life on the internet that web admins have to put up with.

如果您想过滤掉来自机器人的所有页面点击,不幸的是,如果机器人伪造其凭据,则没有 100% 可靠的方法来做到这一点。这只是网络管理员必须忍受的互联网生活中令人讨厌的事实。

回答by Arda

I found this package, it's actively being developed and I'm quite liking it so far:

我找到了这个包,它正在积极开发中,到目前为止我很喜欢它:

https://github.com/JayBizzle/Crawler-Detect

https://github.com/JayBizzle/Crawler-Detect

It's simple as this:

这很简单:

use Jaybizzle\CrawlerDetect\CrawlerDetect;

$CrawlerDetect = new CrawlerDetect;

// Check the user agent of the current 'visitor'
if($CrawlerDetect->isCrawler()) {
    // true if crawler user agent detected
}

// Pass a user agent as a string
if($CrawlerDetect->isCrawler('Mozilla/5.0 (compatible; Sosospider/2.0; +http://help.soso.com/webspider.htm)')) {
    // true if crawler user agent detected
}

// Output the name of the bot that matched (if any)
echo $CrawlerDetect->getMatches();

回答by rubo77

useragentstring.comis serving a lilst that you can use to analyze the userstring:

useragentstring.com正在提供一个 lilst,您可以使用它来分析用户字符串:

$api_request="http://www.useragentstring.com/?uas=".urlencode($_SERVER['HTTP_USER_AGENT'])."&getJSON=all";
$ua=json_decode(file_get_contents($api_request));
if($ua["agent_type"]=="Crawler") die();