php 如何检测假用户(爬虫)和 cURL

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/12257584/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 03:09:32  来源:igfitidea点击:

How to detect fake users ( crawlers ) and cURL

phpcurlspam-prevention

提问by Ken Le

Some other website use cURL and fake http referer to copy my website content. Do we have any way to detect cURL or not real web browser ?

其他一些网站使用 cURL 和假 http 引用来复制我的网站内容。我们有没有办法检测 cURL 或不是真正的网络浏览器?

回答by Alain Tiemblo

There is no magic solution to avoid automatic crawling. Everyting a human can do, a robot can do it too. There are only solutions to make the job harder, so hard that only strong skilled geeks may try to pass them.

没有什么神奇的解决方案可以避免自动爬行。人类能做的每一件事,机器人也能做到。只有使工作更难的解决方案,如此之难以至于只有技术娴熟的极客可能会试图通过它们。

I was in trouble too some years ago and my first advice is, if you have time, be a crawler yourself (I assume a "crawler" is the guy who crawls your website), this is the best school for the subject. By crawling several websites, I learned different kind of protections, and by associating them I've been efficient.

几年前我也遇到了麻烦,我的第一个建议是,如果你有时间,自己做一个爬虫(我认为“爬虫”就是爬你的网站的人),这是该学科最好的学校。通过爬取几个网站,我学会了不同类型的保护,通过将它们关联起来,我变得高效了。

I give you some examples of protections you may try.

我给你一些你可以尝试的保护例子。



Sessions per IP

每个 IP 的会话数

If a user uses 50 new sessions each minute, you can think this user could be a crawler who does not handle cookies. Of course, curl manages cookies perfectly, but if you couple it with a visit counter per session (explained later), or if your crawler is a noobie with cookie matters, it may be efficient.

如果用户每分钟使用 50 个新会话,您可以认为该用户可能是一个不处理 cookie 的爬虫。当然,curl 完美地管理 cookie,但是如果您将它与每个会话的访问计数器结合起来(稍后解释),或者如果您的爬虫是 cookie 问题的新手,它可能会很有效。

It is difficult to imagine that 50 people of the same shared connection will get simultaneousely on your website (it of course depends on your traffic, that is up to you). And if this happens you can lock pages of your website until a captcha is filled.

很难想象同一共享连接的 50 个人会同时访问您的网站(这当然取决于您的流量,这取决于您)。如果发生这种情况,您可以锁定网站页面,直到验证码填满为止。

Idea :

主意 :

1) you create 2 tables : 1 to save banned ips and 1 to save ip and sessions

1)您创建了 2 个表:1 个用于保存禁止的 ip,1 个用于保存 ip 和会话

create table if not exists sessions_per_ip (
  ip int unsigned,
  session_id varchar(32),
  creation timestamp default current_timestamp,
  primary key(ip, session_id)
);

create table if not exists banned_ips (
  ip int unsigned,
  creation timestamp default current_timestamp,
  primary key(ip)
);

2) at the beginning of your script, you delete entries too old from both tables

2)在脚本的开头,您从两个表中删除了太旧的条目

3) next you check if ip of your user is banned or not (you set a flag to true)

3)接下来检查您的用户的IP是否被禁止(您将标志设置为true)

4) if not, you count how much he has sessions for his ip

4)如果没有,你计算他的ip会话数

5) if he has too much sessions, you insert it in your banned table and set a flag

5)如果他有太多的会话,你把它插入你的禁止表并设置一个标志

6) you insert his ip on the sessions per ip table if it has not been already inserted

6)如果尚未插入,则在每个 ip 表的会话中插入他的 ip

I wrote a code sample to show in a better way my idea.

我写了一个代码示例以更好的方式展示我的想法。

<?php

try
{

    // Some configuration (small values for demo)
    $max_sessions = 5; // 5 sessions/ip simultaneousely allowed
    $check_duration = 30; // 30 secs max lifetime of an ip on the sessions_per_ip table
    $lock_duration = 60; // time to lock your website for this ip if max_sessions is reached

    // Mysql connection
    require_once("config.php");
    $dbh = new PDO("mysql:host={$host};dbname={$base}", $user, $password);
    $dbh->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);

    // Delete old entries in tables
    $query = "delete from sessions_per_ip where timestampdiff(second, creation, now()) > {$check_duration}";
    $dbh->exec($query);

    $query = "delete from banned_ips where timestampdiff(second, creation, now()) > {$lock_duration}";
    $dbh->exec($query);

    // Get useful info attached to our user...
    session_start();
    $ip = ip2long($_SERVER['REMOTE_ADDR']);
    $session_id = session_id();

    // Check if IP is already banned
    $banned = false;
    $count = $dbh->query("select count(*) from banned_ips where ip = '{$ip}'")->fetchColumn();
    if ($count > 0)
    {
        $banned = true;
    }
    else
    {
        // Count entries in our db for this ip
        $query = "select count(*)  from sessions_per_ip where ip = '{$ip}'";
        $count = $dbh->query($query)->fetchColumn();
        if ($count >= $max_sessions)
        {
            // Lock website for this ip
            $query = "insert ignore into banned_ips ( ip ) values ( '{$ip}' )";
            $dbh->exec($query);
            $banned = true;
        }

        // Insert a new entry on our db if user's session is not already recorded
        $query = "insert ignore into sessions_per_ip ( ip, session_id ) values ('{$ip}', '{$session_id}')";
        $dbh->exec($query);
    }

    // At this point you have a $banned if your user is banned or not.
    // The following code will allow us to test it...

    // We do not display anything now because we'll play with sessions :
    // to make the demo more readable I prefer going step by step like
    // this.
    ob_start();

    // Displays your current sessions
    echo "Your current sessions keys are : <br/>";
    $query = "select session_id from sessions_per_ip where ip = '{$ip}'";
    foreach ($dbh->query($query) as $row) {
        echo "{$row['session_id']}<br/>";
    }

    // Display and handle a way to create new sessions
    echo str_repeat('<br/>', 2);
    echo '<a href="' . basename(__FILE__) . '?new=1">Create a new session / reload</a>';
    if (isset($_GET['new']))
    {
        session_regenerate_id();
        session_destroy();
        header("Location: " . basename(__FILE__));
        die();
    }

    // Display if you're banned or not
    echo str_repeat('<br/>', 2);
    if ($banned)
    {
        echo '<span style="color:red;">You are banned: wait 60secs to be unbanned... a captcha must be more friendly of course!</span>';
        echo '<br/>';
        echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
    }
    else
    {
        echo '<span style="color:blue;">You are not banned!</span>';
        echo '<br/>';
        echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
    }
    ob_end_flush();
}
catch (PDOException $e)
{
    /*echo*/ $e->getMessage();
}

?>


Visit Counter

参观柜台

If your user uses the same cookie to crawl your pages, you'll be able to use his session to block it. This idea is quite simple: is it possible that your user visits 60 pages in 60 seconds?

如果您的用户使用相同的 cookie 来抓取您的页面,您将能够使用他的会话来阻止它。这个想法很简单:你的用户有可能在 60 秒内访问 60 个页面吗?

Idea :

主意 :

  1. Create an array in the user session, it will contains visit time()s.
  2. Remove visits older than X seconds in this array
  3. Add a new entry for the actual visit
  4. Count entries in this array
  5. Ban your user if he visited Y pages
  1. 在用户会话中创建一个数组,它将包含访问时间()s。
  2. 删除此数组中超过 X 秒的访问
  3. 为实际访问添加一个新条目
  4. 计算此数组中的条目
  5. 如果您的用户访问了 Y 页面,则禁止他

Sample code :

示例代码:

<?php

$visit_counter_pages = 5; // maximum number of pages to load
$visit_counter_secs = 10; // maximum amount of time before cleaning visits

session_start();

// initialize an array for our visit counter
if (array_key_exists('visit_counter', $_SESSION) == false)
{
    $_SESSION['visit_counter'] = array();
}

// clean old visits
foreach ($_SESSION['visit_counter'] as $key => $time)
{
    if ((time() - $time) > $visit_counter_secs) {
        unset($_SESSION['visit_counter'][$key]);
    }
}

// we add the current visit into our array
$_SESSION['visit_counter'][] = time();

// check if user has reached limit of visited pages
$banned = false;
if (count($_SESSION['visit_counter']) > $visit_counter_pages)
{
    // puts ip of our user on the same "banned table" as earlier...
    $banned = true;
}

// At this point you have a $banned if your user is banned or not.
// The following code will allow us to test it...

echo '<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js"></script>';

// Display counter
$count = count($_SESSION['visit_counter']);
echo "You visited {$count} pages.";
echo str_repeat('<br/>', 2);

echo <<< EOT

<a id="reload" href="#">Reload</a>

<script type="text/javascript">

  $('#reload').click(function(e) {
    e.preventDefault();
    window.location.reload();
  });

</script>

EOT;

echo str_repeat('<br/>', 2);

// Display if you're banned or not
echo str_repeat('<br/>', 2);
if ($banned)
{
    echo '<span style="color:red;">You are banned! Wait for a short while (10 secs in this demo)...</span>';
    echo '<br/>';
    echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
}
else
{
    echo '<span style="color:blue;">You are not banned!</span>';
    echo '<br/>';
    echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
}
?>


An image to download

要下载的图像

When a crawler need to do his dirty work, that's for a large amount of data, and in a shortest possible time. That's why they don't download images on pages ; it takes too much bandwith and makes the crawling slower.

当一个爬虫需要做他的脏活时,那是为了大量的数据,而且是在尽可能短的时间内。这就是为什么他们不在页面上下载图像;它需要太多带宽并使爬行速度变慢。

This idea (I think the most elegent and the most easy to implement) uses the mod_rewriteto hide code in a .jpg/.png/… an image file. This image should be available on each page you want to protect : it could be your logo website, but you'll choose a small-sized image (because this image must not be cached).

这个想法(我认为最优雅也最容易实现)使用mod_rewrite将代码隐藏在 .jpg/.png/... 图像文件中。该图像应该在您要保护的每个页面上可用:它可以是您的徽标网站,但您将选择一个小尺寸的图像(因为该图像不得被缓存)。

Idea :

主意 :

1/ Add those lines to your .htaccess

1/ 将这些行添加到您的 .htaccess

RewriteEngine On
RewriteBase /tests/anticrawl/
RewriteRule ^logo\.jpg$ logo.php

2/ Create your logo.php with the security

2/ 使用安全性创建您的 logo.php

<?php

// start session and reset counter
session_start();
$_SESSION['no_logo_count'] = 0;

// forces image to reload next time
header("Cache-Control: no-store, no-cache, must-revalidate");

// displays image
header("Content-type: image/jpg");
readfile("logo.jpg");
die();

3/ Increment your no_logo_count on each page you need to add security, and check if it reached your limit.

3/ 在您需要添加安全性的每个页面上增加 no_logo_count,并检查它是否达到您的限制。

Sample code :

示例代码:

<?php

$no_logo_limit = 5; // number of allowd pages without logo

// start session and initialize
session_start();
if (array_key_exists('no_logo_count', $_SESSION) == false)
{
    $_SESSION['no_logo_count'] = 0;
}
else
{
    $_SESSION['no_logo_count']++;
}

// check if user has reached limit of "undownloaded image"
$banned = false;
if ($_SESSION['no_logo_count'] >= $no_logo_limit)
{
    // puts ip of our user on the same "banned table" as earlier...
    $banned = true;
}

// At this point you have a $banned if your user is banned or not.
// The following code will allow us to test it...

echo '<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js"></script>';

// Display counter
echo "You did not loaded image {$_SESSION['no_logo_count']} times.";
echo str_repeat('<br/>', 2);

// Display "reload" link
echo <<< EOT

<a id="reload" href="#">Reload</a>

<script type="text/javascript">

  $('#reload').click(function(e) {
    e.preventDefault();
    window.location.reload();
  });

</script>

EOT;

echo str_repeat('<br/>', 2);

// Display "show image" link : note that we're using .jpg file
echo <<< EOT

<div id="image_container">
    <a id="image_load" href="#">Load image</a>
</div>
<br/>

<script type="text/javascript">

  // On your implementation, you'llO of course use <img src="logo.jpg" />
  $('#image_load').click(function(e) {
    e.preventDefault();
    $('#image_load').html('<img src="logo.jpg" />');
  });

</script>

EOT;

// Display if you're banned or not
echo str_repeat('<br/>', 2);
if ($banned)
{
    echo '<span style="color:red;">You are banned: click on "load image" and reload...</span>';
    echo '<br/>';
    echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
}
else
{
    echo '<span style="color:blue;">You are not banned!</span>';
    echo '<br/>';
    echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
}
?>


Cookie check

饼干检查

You can create cookies in the javascript side to check if your users does interpret javascript (a crawler using Curl does not, for example).

您可以在 javascript 端创建 cookie 以检查您的用户是否会解释 javascript(例如,使用 Curl 的爬虫不会)。

The idea is quite simple : this is about the same as an image check.

这个想法很简单:这与图像检查大致相同。

  1. Set a $_SESSION value to 1 and increment it in each visits
  2. if a cookie (set in JavaScript) does exist, set session value to 0
  3. if this value reached a limit, ban your user
  1. 将 $_SESSION 值设置为 1 并在每次访问时递增
  2. 如果 cookie(在 JavaScript 中设置)确实存在,则将会话值设置为 0
  3. 如果此值达到限制,请禁止您的用户

Code :

代码 :

<?php

$no_cookie_limit = 5; // number of allowd pages without cookie set check

// Start session and reset counter
session_start();

if (array_key_exists('cookie_check_count', $_SESSION) == false)
{
    $_SESSION['cookie_check_count'] = 0;
}

// Initializes cookie (note: rename it to a more discrete name of course) or check cookie value
if ((array_key_exists('cookie_check', $_COOKIE) == false) || ($_COOKIE['cookie_check'] != 42))
{
    // Cookie does not exist or is incorrect...
    $_SESSION['cookie_check_count']++;
}
else
{
    // Cookie is properly set so we reset counter
    $_SESSION['cookie_check_count'] = 0;
}

// Check if user has reached limit of "cookie check"
$banned = false;
if ($_SESSION['cookie_check_count'] >= $no_cookie_limit)
{
    // puts ip of our user on the same "banned table" as earlier...
    $banned = true;
}

// At this point you have a $banned if your user is banned or not.
// The following code will allow us to test it...

echo '<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js"></script>';

// Display counter
echo "Cookie check failed {$_SESSION['cookie_check_count']} times.";
echo str_repeat('<br/>', 2);

// Display "reload" link
echo <<< EOT

<br/>
<a id="reload" href="#">Reload</a>
<br/>

<script type="text/javascript">

  $('#reload').click(function(e) {
    e.preventDefault();
    window.location.reload();
  });

</script>

EOT;

// Display "set cookie" link
echo <<< EOT

<br/>
<a id="cookie_link" href="#">Set cookie</a>
<br/>

<script type="text/javascript">

  // On your implementation, you'll of course put the cookie set on a $(document).ready()
  $('#cookie_link').click(function(e) {
    e.preventDefault();
    var expires = new Date();
    expires.setTime(new Date().getTime() + 3600000);
    document.cookie="cookie_check=42;expires=" + expires.toGMTString();
  });

</script>
EOT;


// Display "unset cookie" link
echo <<< EOT

<br/>
<a id="unset_cookie" href="#">Unset cookie</a>
<br/>

<script type="text/javascript">

  // On your implementation, you'll of course put the cookie set on a $(document).ready()
  $('#unset_cookie').click(function(e) {
    e.preventDefault();
    document.cookie="cookie_check=;expires=Thu, 01 Jan 1970 00:00:01 GMT";
  });

</script>
EOT;

// Display if you're banned or not
echo str_repeat('<br/>', 2);
if ($banned)
{
    echo '<span style="color:red;">You are banned: click on "Set cookie" and reload...</span>';
    echo '<br/>';
    echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
}
else
{
    echo '<span style="color:blue;">You are not banned!</span>';
    echo '<br/>';
    echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
}


Protection against proxies

代理保护

Some words about the different kind of proxies we may find over the web :

关于我们可能会在网络上找到的不同类型的代理的一些词:

  • A “normal” proxy displays information about user connection (notably, his IP)
  • An anonymous proxy does not display IP, but gives information about proxy usage on header.
  • A high-anonyous proxy do not display user IP, and do not display any information that a browser may not send.
  • “普通”代理显示有关用户连接的信息(特别是他的 IP)
  • 匿名代理不显示 IP,但会在标头上提供有关代理使用情况的信息。
  • 高匿名代理不显示用户IP,不显示浏览器不能发送的任何信息。

It is easy to find a proxy to connect any website, but it is very hard to find high-anonymous proxies.

找到连接任何网站的代理很容易,但很难找到高度匿名的代理。

Some $_SERVER variables may contain keys specifically if your users is behind a proxy (exhaustive list took from this question):

如果您的用户位于代理后面,则某些 $_SERVER 变量可能包含特定的键(从这个问题中获取的详尽列表):

  • CLIENT_IP
  • FORWARDED
  • FORWARDED_FOR
  • FORWARDED_FOR_IP
  • HTTP_CLIENT_IP
  • HTTP_FORWARDED
  • HTTP_FORWARDED_FOR
  • HTTP_FORWARDED_FOR_IP
  • HTTP_PC_REMOTE_ADDR
  • HTTP_PROXY_CONNECTION'
  • HTTP_VIA
  • HTTP_X_FORWARDED
  • HTTP_X_FORWARDED_FOR
  • HTTP_X_FORWARDED_FOR_IP
  • HTTP_X_IMFORWARDS
  • HTTP_XROXY_CONNECTION
  • VIA
  • X_FORWARDED
  • X_FORWARDED_FOR
  • 客户端_IP
  • 转发
  • FORWARDED_FOR
  • FORWARDED_FOR_IP
  • HTTP_CLIENT_IP
  • HTTP_FORWARDED
  • HTTP_FORWARDED_FOR
  • HTTP_FORWARDED_FOR_IP
  • HTTP_PC_REMOTE_ADDR
  • HTTP_PROXY_CONNECTION'
  • HTTP_VIA
  • HTTP_X_FORWARDED
  • HTTP_X_FORWARDED_FOR
  • HTTP_X_FORWARDED_FOR_IP
  • HTTP_X_IMFORWARDS
  • HTTP_XROXY_CONNECTION
  • 通过
  • X_FORWARDED
  • X_FORWARDED_FOR

You may give a different behavior (lower limits etc) to your anti crawl securities if you detect one of those keys on your $_SERVERvariable.

如果您在$_SERVER变量上检测到这些键之一,您可能会为您的反爬行证券提供不同的行为(下限等)。



Conclusion

结论

There is a lot of ways to detect abuses on your website, so you'll find a solution for sure. But you need to know precisely how your website is used, so your securities will not be aggressive with your "normal" users.

有很多方法可以检测您网站上的滥用行为,因此您肯定会找到解决方案。但是您需要准确了解您的网站是如何使用的,这样您的证券就不会受到“普通”用户的攻击。

回答by raina77ow

Remember: HTTP is not magic. There's a defined set of headers sent with each HTTP request; if these headers are sent by web-browser, they can as well be sent by any program - including cURL (and libcurl).

请记住:HTTP 不是魔术。每个 HTTP 请求都会发送一组已定义的标头;如果这些标头是由网络浏览器发送的,它们也可以由任何程序发送——包括 cURL(和 libcurl)。

Some consider it a curse, but on the other hand, it's a blessing, as it greatly simplifies functional testing of web applications.

有些人认为这是一种诅咒,但另一方面,这是一种祝福,因为它极大地简化了 Web 应用程序的功能测试。

UPDATE:As unr3al011 rightly noticed, curl doesn't execute JavaScript, so in theory it's possible to create a page that will behave differently when viewed by grabbers (for example, with setting and, later, checking a specific cookie by JS means).

更新:正如 unr3al011 正确注意到的那样,curl 不执行 JavaScript,所以理论上可以创建一个页面,当抓取器查看时,它的行为会有所不同(例如,通过设置,然后通过 JS 方式检查特定的 cookie)。

Still, it'd be a very fragile defense. The page's data still had to be grabbed from server - and this HTTP request (and it's alwaysHTTP request) can be emulated by curl. Check this answerfor example of how to defeat such defense.

尽管如此,这将是一个非常脆弱的防御。页面的数据仍然必须从服务器获取 - 而这个 HTTP 请求(它总是HTTP 请求)可以被 curl 模拟。检查这个答案,例如如何击败这种防御。

... and I didn't even mention that some grabbers areable to execute JavaScript. )

...我什至没有提到某些抓取能够执行 JavaScript。)

回答by Maks3w

The way of avoid fake referers is tracking the user

避免虚假推荐人的方法是跟踪用户

You can track the user by one or more of this methods:

您可以通过以下一种或多种方法跟踪用户:

  1. Save a cookie in the browser client with some special code (ex: last url visited, a timestamp) and verify it in each response of your server.

  2. Same as before but using sessions instead of explicit cookies

  1. 使用一些特殊代码(例如:上次访问的 url、时间戳)在浏览器客户端中保存 cookie,并在服务器的每个响应中验证它。

  2. 与以前相同,但使用会话而不是显式 cookie

For cookies you should add cryptographic security like.

对于 cookie,您应该添加加密安全性,例如。

[Cookie]
url => http://someurl/
hash => dsafdshfdslajfd

hash is calulated in PHP by this way

散列是通过这种方式在 PHP 中计算的

$url = $_COOKIE['url'];
$hash = $_COOKIE['hash'];
$secret = 'This is a fixed secret in the code of your application';

$isValidCookie = (hash('algo', $secret . $url) === $hash);

$isValidReferer = $isValidCookie & ($_SERVER['HTTP_REFERER'] === $url)

回答by Fusca Software

You can detect cURL-Useragent by the following method. But be warned the useragent could be overwritten by user, anyway default settings could be recognized by:

您可以通过以下方法检测 cURL-Useragent。但请注意,用户代理可能会被用户覆盖,无论如何默认设置可以通过以下方式识别:

function is_curl() {
    if (stristr($_SERVER["HTTP_USER_AGENT"], 'curl'))
        return true;
}

回答by Rayvyn

As some have mentioned cURL cannot execute JavaScritp (to my knowledge) so you could possibly try setting someting up like raina77ow suggest but that would not wokrk for other grabbers/donwloaders.

正如一些人提到的,cURL 无法执行 JavaScritp(据我所知),因此您可以尝试设置一些像 Raina77ow 建议的东西,但这对其他抓取器/加载器不起作用。

I suggest you try building a bot trapthat way you deal with the grabbers/downloaders that can execute JavaScript.

我建议你尝试构建一个机器人陷阱,这样你就可以处理可以执行 JavaScript 的抓取器/下载器。

I don't know of any 1 solution to fully prevent this, so my best recommendation would be to try multiple solutions:

我不知道有任何一种解决方案可以完全防止这种情况发生,所以我最好的建议是尝试多种解决方案:

1) only allow known user agents such as all mainstream browsers in your .htaccess file

1) .htaccess 文件中只允许已知的用户代理,例如所有主流浏览器

2) Set up your robots.txt to prevent bots

2) 设置您的 robots.txt 以防止机器人

3) Set up a bot trap for bots that do not respect the robots.txt file

3) 为不遵守 robots.txt 文件的机器人设置机器人陷阱

回答by Marcel Gent Simonis

put this into root folder as .htaccessfile. it may help. I found it on one webhosting provider site but dunno what that means :)

将其作为.htaccess文件放入根文件夹。它可能会有所帮助。我在一个虚拟主机提供商网站上找到了它,但不知道这意味着什么:)

SetEnvIf User-Agent ^Teleport graber   
SetEnvIf User-Agent ^w3m graber    
SetEnvIf User-Agent ^Offline graber   
SetEnvIf User-Agent Downloader graber  
SetEnvIf User-Agent snake graber  
SetEnvIf User-Agent Xenu graber   
Deny from env=graber