我可以阻止 Apache Web 服务器上每个站点的搜索爬虫吗？

Question

提问by Nick Messick

I have somewhat of a staging server on the public internet running copies of the production code for a few websites. I'd really not like it if the staging sites get indexed.

我在公共互联网上有一个临时服务器，运行着一些网站的生产代码的副本。如果临时站点被索引，我真的不喜欢它。

Is there a way I can modify my httpd.conf on the staging server to block search engine crawlers?

有没有办法可以在登台服务器上修改我的 httpd.conf 以阻止搜索引擎爬虫？

Changing the robots.txt wouldn't really work since I use scripts to copy the same code base to both servers. Also, I would rather not change the virtual host conf files either as there is a bunch of sites and I don't want to have to remember to copy over a certain setting if I make a new site.

更改 robots.txt 不会真正起作用，因为我使用脚本将相同的代码库复制到两台服务器。此外，我也不想更改虚拟主机 conf 文件，因为有很多站点，而且我不想在创建新站点时记住复制某个设置。

Answer 1

回答by jsdalton

Create a robots.txt file with the following contents:

创建一个包含以下内容的 robots.txt 文件：

User-agent: *
Disallow: /

Put that file somewhere on your staging server; your directory root is a great place for it (e.g. /var/www/html/robots.txt).

将该文件放在临时服务器上的某个位置；你的目录根是一个很好的地方（例如/var/www/html/robots.txt）。

Add the following to your httpd.conf file:

将以下内容添加到您的 httpd.conf 文件中：

# Exclude all robots
<Location "/robots.txt">
    SetHandler None
</Location>
Alias /robots.txt /path/to/robots.txt

The SetHandlerdirective is probably not required, but it might be needed if you're using a handler like mod_python, for example.

该SetHandler指令可能不是必需的，但例如，如果您使用像 mod_python 这样的处理程序，则可能需要它。

That robots.txt file will now be served for all virtual hosts on your server, overriding any robots.txt file you might have for individual hosts.

该 robots.txt 文件现在将为您服务器上的所有虚拟主机提供服务，覆盖您可能为单个主机拥有的任何 robots.txt 文件。

(Note: My answer is essentially the same thing that ceejayoz's answer is suggesting you do, but I had to spend a few extra minutes figuring out all the specifics to get it to work. I decided to put this answer here for the sake of others who might stumble upon this question.)

（注意：我的回答与 ceejayoz 的回答建议你做的基本上是一样的，但我不得不多花几分钟弄清楚所有细节才能让它发挥作用。为了其他人，我决定把这个答案放在这里谁可能会偶然发现这个问题。）

Answer 2

回答by Nick Messick

You can use Apache's mod_rewrite to do it. Let's assume that your real host is www.example.com and your staging host is staging.example.com. Create a file called 'robots-staging.txt' and conditionally rewrite the request to go to that.

您可以使用 Apache 的 mod_rewrite 来做到这一点。假设您的真实主机是 www.example.com，而临时主机是 staging.example.com。创建一个名为“robots-staging.txt”的文件，并有条件地重写请求以转到该文件。

This example would be suitable for protecting a single staging site, a bit of a simpler use case than what you are asking for, but this has worked reliably for me:

此示例适用于保护单个登台站点，这是一个比您要求的更简单的用例，但这对我来说很可靠：

<IfModule mod_rewrite.c>
  RewriteEngine on

  # Dissuade web spiders from crawling the staging site
  RewriteCond %{HTTP_HOST}  ^staging\.example\.com$
  RewriteRule ^robots.txt$ robots-staging.txt [L]
</IfModule>

You could try to redirect the spiders to a master robots.txt on a different server, but some of the spiders may balk after they get anything other than a "200 OK" or "404 not found" return code from the HTTP request, and they may not read the redirected URL.

您可以尝试将蜘蛛重定向到不同服务器上的主 robots.txt，但是某些蜘蛛在从 HTTP 请求中获得除“200 OK”或“404 not found”以外的任何返回码后可能会犹豫不决，并且他们可能无法读取重定向的 URL。

Here's how you would do that:

以下是你将如何做到这一点：

<IfModule mod_rewrite.c>
  RewriteEngine on

  # Redirect web spiders to a robots.txt file elsewhere (possibly unreliable)
  RewriteRule ^robots.txt$ http://www.example.com/robots-staging.txt [R]
</IfModule>

Answer 3

回答by chazomaticus

To truly stop pages from being indexed, you'll need to hide the sites behind HTTP auth. You can do this in your global Apache config and use a simple .htpasswd file.

要真正阻止页面被索引，您需要将站点隐藏在HTTP auth 之后。您可以在全局 Apache 配置中执行此操作并使用简单的 .htpasswd 文件。

Only downside to this is you now have to type in a username/password the first time you browse to any pages on the staging server.

唯一的缺点是您现在必须在第一次浏览登台服务器上的任何页面时输入用户名/密码。

Answer 4

回答by ceejayoz

Could you alias robots.txt on the staging virtualhosts to a restrictive robots.txt hosted in a different location?

您能否将暂存虚拟主机上的 robots.txt 别名为托管在不同位置的受限 robots.txt？

Answer 5

回答by Kevin

Depending on your deployment scenario, you should look for ways to deploy different robots.txt files to dev/stage/test/prod (or whatever combination you have). Assuming you have different database config files or (or whatever's analogous) on the different servers, this should follow a similar process (you dohave different passwords for your databases, right?)

根据您的部署方案，您应该寻找将不同的 robots.txt 文件部署到 dev/stage/test/prod（或您拥有的任何组合）的方法。假设您在不同的服务器上有不同的数据库配置文件或（或类似的），这应该遵循类似的过程（您的数据库确实有不同的密码，对吗？）

If you don't have a one-step deployment process in place, this is probably good motivation to get one... there are tons of tools out there for different environments - Capistrano is a pretty good one, and favored in the Rails/Django world, but is by no means the only one.

如果您没有适当的一步部署过程，这可能是一个很好的动机......有很多工具可以用于不同的环境 - Capistrano 是一个非常好的工具，并且在 Rails/ 中受到青睐Django 的世界，但绝不是唯一的。

Failing all that, you could probably set up a global Alias directive in your Apache config that would apply to all virtualhosts and point to a restrictive robots.txt

如果所有这些都失败了，您可能可以在 Apache 配置中设置一个全局别名指令，该指令将适用于所有虚拟主机并指向一个限制性的 robots.txt

Answer 6

回答by Greg

Try Using Apache to stop bad robots. You can get the user agentsonline or just allow browsers, rather than trying to block all bots.

尝试使用 Apache 阻止坏机器人。您可以让用户代理在线或只允许浏览器，而不是试图阻止所有机器人。

我可以阻止 Apache Web 服务器上每个站点的搜索爬虫吗？

提问by Nick Messick

回答by jsdalton

回答by Nick Messick

回答by chazomaticus

回答by ceejayoz

回答by Kevin

回答by Greg

相关推荐

最近更新

标签

我可以阻止 Apache Web 服务器上每个站点的搜索爬虫吗？

提问by Nick Messick

回答by jsdalton

回答by Nick Messick

回答by chazomaticus

回答by ceejayoz

回答by Kevin

回答by Greg

相关推荐

在 Linux 上沙箱 Apache 的最佳方法

apache htaccess 文件、php、包含目录和 windows XAMPP 配置噩梦

apache 设置 HTTP 代理以插入标头

apache Perl 最好的 XSLT 引擎是什么？

相关推荐

最近更新

标签