Html 如何防止网站抓取?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3161548/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I prevent site scraping?
提问by pixel
I have a fairly large music website with a large artist database. I've been noticing other music sites scraping our site's data (I enter dummy Artist names here and there and then do google searches for them).
我有一个相当大的音乐网站,里面有一个很大的艺术家数据库。我一直注意到其他音乐网站在抓取我们网站的数据(我在这里和那里输入虚拟艺术家的名字,然后在谷歌上搜索它们)。
How can I prevent screen scraping? Is it even possible?
如何防止屏幕刮擦?甚至有可能吗?
回答by JonasCz - Reinstate Monica
Note:Since the complete version of this answer exceeds Stack Overflow's length limit, you'll need to head to GitHubto read the extended version, with more tips and details.
注意:由于此答案的完整版本超出了 Stack Overflow 的长度限制,您需要前往 GitHub阅读扩展版本,其中包含更多提示和详细信息。
In order to hinder scraping (also known as Webscraping, Screenscraping, Web data mining, Web harvesting, or Web data extraction), it helps to know how these scrapers work, and , by extension, what prevents them from working well.
为了阻止抓取(也称为Webscraping、Screenscraping、Web 数据挖掘、Web 收集或Web 数据提取),了解这些抓取工具的工作原理很有帮助,进而了解是什么阻止了它们正常工作。
There's various types of scraper, and each works differently:
有多种类型的刮刀,每种类型的工作方式都不同:
Spiders, such as Google's botor website copiers like HTtrack, which recursively follow links to other pages in order to get data. These are sometimes used for targeted scraping to get specific data, often in combination with a HTML parser to extract the desired data from each page.
Shell scripts: Sometimes, common Unix tools are used for scraping: Wget or Curl to download pages, and Grep (Regex) to extract the data.
HTML parsers, such as ones based on Jsoup, Scrapy, and others. Similar to shell-script regex based ones, these work by extracting data from pages based on patterns in HTML, usually ignoring everything else.
For example: If your website has a search feature, such a scraper might submit a request for a search, and then get all the result links and their titles from the results page HTML, in order to specifically get only search result links and their titles. These are the most common.
Screenscrapers, based on eg. Seleniumor PhantomJS, which open your website in a real browser, run JavaScript, AJAX, and so on, and then get the desired text from the webpage, usually by:
Getting the HTML from the browser after your page has been loaded and JavaScript has run, and then using a HTML parser to extract the desired data. These are the most common, and so many of the methods for breaking HTML parsers / scrapers also work here.
Taking a screenshot of the rendered pages, and then using OCR to extract the desired text from the screenshot. These are rare, and only dedicated scrapers who really want your data will set this up.
Webscraping services such as ScrapingHubor Kimono. In fact, there's people whose job is to figure out how to scrape your site and pull out the content for others to use.
Unsurprisingly, professional scraping services are the hardest to deter, but if you make it hard and time-consuming to figure out how to scrape your site, these (and people who pay them to do so) may not be bothered to scrape your website.
Embedding your website in other site's pages with frames, and embedding your site in mobile apps.
While not technically scraping, mobile apps (Android and iOS) can embed websites, and inject custom CSS and JavaScript, thus completely changing the appearance of your pages.
Human copy - paste: People will copy and paste your content in order to use it elsewhere.
蜘蛛程序,例如Google 的机器人程序或网站复制程序,例如HTtrack,它们递归地跟踪指向其他页面的链接以获取数据。这些有时用于有针对性的抓取以获取特定数据,通常与 HTML 解析器结合使用以从每个页面中提取所需的数据。
Shell 脚本:有时,会使用常见的 Unix 工具进行抓取:Wget 或 Curl 下载页面,以及 Grep (Regex) 提取数据。
HTML 解析器,例如基于 Jsoup、Scrapy等的解析器。与基于 shell 脚本正则表达式的类似,它们通过基于 HTML 中的模式从页面中提取数据来工作,通常会忽略其他所有内容。
例如:如果你的网站有搜索功能,这样的scraper可能会提交一个搜索请求,然后从结果页HTML中获取所有结果链接及其标题,以便专门只获取搜索结果链接及其标题. 这些是最常见的。
Screenscrapers,基于例如。Selenium或PhantomJS,它们在真实浏览器中打开您的网站,运行 JavaScript、AJAX 等,然后从网页中获取所需的文本,通常通过以下方式:
在加载页面并运行 JavaScript 后从浏览器获取 HTML,然后使用 HTML 解析器提取所需的数据。这些是最常见的,因此许多用于破坏 HTML 解析器/抓取器的方法也可以在这里使用。
截取渲染页面的屏幕截图,然后使用 OCR 从屏幕截图中提取所需的文本。这些很少见,只有真正想要您的数据的专用刮刀才会设置它。
网页抓取服务,例如ScrapingHub或Kimono。事实上,有些人的工作是弄清楚如何抓取您的网站并提取内容供其他人使用。
不出所料,专业的抓取服务是最难阻止的,但是如果您使弄清楚如何抓取您的网站变得困难且耗时,那么这些(以及为此付费的人)可能不会费心去抓取您的网站。
使用框架将您的网站嵌入到其他网站的页面中,并将您的网站嵌入到移动应用程序中。
虽然不是技术上的抓取,但移动应用程序(Android 和 iOS)可以嵌入网站,并注入自定义 CSS 和 JavaScript,从而完全改变页面的外观。
人工复制 - 粘贴:人们会复制和粘贴您的内容,以便在其他地方使用。
There is a lot overlap between these different kinds of scraper, and many scrapers will behave similarly, even if they use different technologies and methods.
这些不同种类的刮板之间有很多重叠,即使它们使用不同的技术和方法,许多刮板的行为也会相似。
These tips mostly my own ideas, various difficulties that I've encountered while writing scrapers, as well as bits of information and ideas from around the interwebs.
这些技巧主要是我自己的想法,我在编写爬虫时遇到的各种困难,以及来自互联网的一些信息和想法。
How to stop scraping
如何停止刮痧
You can't completely prevent it, since whatever you do, determined scrapers can still figure out how to scrape. However, you can stop a lot of scraping by doing a few things:
你不能完全阻止它,因为无论你做什么,坚定的抓取者仍然可以弄清楚如何抓取。但是,您可以通过执行以下操作来停止大量抓取:
Monitor your logs & traffic patterns; limit access if you see unusual activity:
监控您的日志和流量模式;如果您看到异常活动,请限制访问:
Check your logs regularly, and in case of unusual activity indicative of automated access (scrapers), such as many similar actions from the same IP address, you can block or limit access.
定期检查您的日志,如果出现表明自动访问(抓取)的异常活动,例如来自同一 IP 地址的许多类似操作,您可以阻止或限制访问。
Specifically, some ideas:
具体来说,一些想法:
Rate limiting:
Only allow users (and scrapers) to perform a limited number of actions in a certain time - for example, only allow a few searches per second from any specific IP address or user. This will slow down scrapers, and make them ineffective. You could also show a captcha if actions are completed too fast or faster than a real user would.
Detect unusual activity:
If you see unusual activity, such as many similar requests from a specific IP address, someone looking at an excessive number of pages or performing an unusual number of searches, you can prevent access, or show a captcha for subsequent requests.
Don't just monitor & rate limit by IP address - use other indicators too:
If you do block or rate limit, don't just do it on a per-IP address basis; you can use other indicators and methods to identify specific users or scrapers. Some indicators which can help you identify specific users / scrapers include:
How fast users fill out forms, and where on a button they click;
You can gather a lot of information with JavaScript, such as screen size / resolution, timezone, installed fonts, etc; you can use this to identify users.
HTTP headers and their order, especially User-Agent.
As an example, if you get many request from a single IP address, all using the same User Agent, screen size (determined with JavaScript), and the user (scraper in this case) always clicks on the button in the same way and at regular intervals, it's probably a screen scraper; and you can temporarily block similar requests (eg. block all requests with that user agent and screen size coming from that particular IP address), and this way you won't inconvenience real users on that IP address, eg. in case of a shared internet connection.
You can also take this further, as you can identify similar requests, even if they come from different IP addresses, indicative of distributed scraping (a scraper using a botnet or a network of proxies). If you get a lot of otherwise identical requests, but they come from different IP addresses, you can block. Again, be aware of not inadvertently blocking real users.
This can be effective against screenscrapers which run JavaScript, as you can get a lot of information from them.
Related questions on Security Stack Exchange:
How to uniquely identify users with the same external IP address?for more details, and
Why do people use IP address bans when IP addresses often change?for info on the limits of these methods.
Instead of temporarily blocking access, use a Captcha:
The simple way to implement rate-limiting would be to temporarily block access for a certain amount of time, however using a Captcha may be better, see the section on Captchas further down.
速率限制:
仅允许用户(和抓取工具)在特定时间内执行有限数量的操作 - 例如,仅允许每秒从任何特定 IP 地址或用户进行几次搜索。这会减慢刮刀的速度,并使它们无效。如果操作完成得比真实用户过快或快,您还可以显示验证码。
检测异常活动:
如果您看到异常活动,例如来自特定 IP 地址的许多类似请求、有人查看过多页面或执行异常数量的搜索,您可以阻止访问,或显示后续请求的验证码。
不要只通过 IP 地址进行监控和速率限制 - 也可以使用其他指标:
如果您要进行阻止或速率限制,请不要只在每个 IP 地址的基础上进行;您可以使用其他指标和方法来识别特定用户或刮刀。一些可以帮助您识别特定用户/抓取工具的指标包括:
用户填写表单的速度以及他们单击按钮的位置;
你可以用 JavaScript 收集很多信息,比如屏幕大小/分辨率、时区、安装的字体等;您可以使用它来识别用户。
HTTP 标头及其顺序,尤其是 User-Agent。
例如,如果您从一个 IP 地址收到许多请求,所有请求都使用相同的用户代理、屏幕大小(由 JavaScript 确定),并且用户(在本例中为抓取工具)总是以相同的方式点击按钮定期,它可能是一个屏幕刮刀;并且您可以暂时阻止类似的请求(例如,阻止来自该特定 IP 地址的具有该用户代理和屏幕大小的所有请求),这样您就不会给该 IP 地址上的真实用户带来不便,例如。在共享互联网连接的情况下。
您还可以更进一步,因为您可以识别类似的请求,即使它们来自不同的 IP 地址,表明分布式抓取(使用僵尸网络或代理网络的抓取)。如果您收到许多其他相同的请求,但它们来自不同的 IP 地址,您可以阻止。同样,请注意不要无意中阻止了真实用户。
这对于运行 JavaScript 的屏幕抓取工具非常有效,因为您可以从中获取大量信息。
Security Stack Exchange 的相关问题:
如何唯一标识具有相同外部IP地址的用户?了解更多详情,以及
当IP地址经常变化时,为什么人们使用IP地址禁令?有关这些方法限制的信息。
而不是暂时阻止访问,使用验证码:
实现速率限制的简单方法是在一定时间内暂时阻止访问,但是使用验证码可能会更好,请参阅进一步的验证码部分。
Require registration & login
需要注册和登录
Require account creation in order to view your content, if this is feasible for your site. This is a good deterrent for scrapers, but is also a good deterrent for real users.
如果您的网站可行,则需要创建帐户才能查看您的内容。这对爬虫来说是一个很好的威慑,但对真实用户来说也是一个很好的威慑。
- If you require account creation and login, you can accurately track user and scraper actions. This way, you can easily detect when a specific account is being used for scraping, and ban it. Things like rate limiting or detecting abuse (such as a huge number of searches in a short time) become easier, as you can identify specific scrapers instead of just IP addresses.
- 如果您需要创建和登录帐户,您可以准确跟踪用户和抓取操作。通过这种方式,您可以轻松检测特定帐户何时被用于抓取并禁止它。诸如速率限制或检测滥用(例如短时间内大量搜索)之类的事情变得更容易,因为您可以识别特定的抓取工具而不仅仅是 IP 地址。
In order to avoid scripts creating many accounts, you should:
为了避免脚本创建多个帐户,您应该:
Require an email address for registration, and verify that email address by sending a link that must be opened in order to activate the account. Allow only one account per email address.
Require a captcha to be solved during registration / account creation.
需要一个用于注册的电子邮件地址,并通过发送必须打开才能激活帐户的链接来验证该电子邮件地址。每个电子邮件地址只允许一个帐户。
需要在注册/帐户创建期间解决验证码。
Requiring account creation to view content will drive users and search engines away; if you require account creation in order to view an article, users will go elsewhere.
需要创建帐户才能查看内容将把用户和搜索引擎赶走;如果您需要创建帐户才能查看文章,用户将前往别处。
Block access from cloud hosting and scraping service IP addresses
阻止来自云托管和抓取服务 IP 地址的访问
Sometimes, scrapers will be run from web hosting services, such as Amazon Web Services or GAE, or VPSes. Limit access to your website (or show a captcha) for requests originating from the IP addresses used by such cloud hosting services.
有时,抓取工具会通过 Web 托管服务运行,例如 Amazon Web Services 或 GAE,或 VPS。限制来自此类云托管服务使用的 IP 地址的请求访问您的网站(或显示验证码)。
Similarly, you can also limit access from IP addresses used by proxy or VPN providers, as scrapers may use such proxy servers to avoid many requests being detected.
同样,您还可以限制来自代理或 VPN 提供商使用的 IP 地址的访问,因为抓取工具可能会使用此类代理服务器来避免检测到许多请求。
Beware that by blocking access from proxy servers and VPNs, you will negatively affect real users.
请注意,通过阻止来自代理服务器和 VPN 的访问,您将对真实用户产生负面影响。
Make your error message nondescript if you do block
如果您确实阻止,请使您的错误消息变得难以描述
If you do block / limit access, you should ensure that you don't tell the scraper what caused the block, thereby giving them clues as to how to fix their scraper. So a bad idea would be to show error pages with text like:
如果你阻止/限制访问,你应该确保你没有告诉刮板是什么导致了阻塞,从而为他们提供了如何修复刮板的线索。因此,一个坏主意是显示带有以下文本的错误页面:
Too many requests from your IP address, please try again later.
Error, User Agent header not present !
来自您 IP 地址的请求过多,请稍后重试。
错误,用户代理标头不存在!
Instead, show a friendly error message that doesn't tell the scraper what caused it. Something like this is much better:
相反,显示一条友好的错误消息,不会告诉刮板是什么原因造成的。这样的事情要好得多:
- Sorry, something went wrong. You can contact support via
[email protected]
, should the problem persist.
- 抱歉,出了一些问题。如果
[email protected]
问题仍然存在,您可以通过 联系支持。
This is also a lot more user friendly for real users, should they ever see such an error page. You should also consider showing a captcha for subsequent requests instead of a hard block, in case a real user sees the error message, so that you don't block and thus cause legitimate users to contact you.
如果他们看到这样的错误页面,这对于真实用户来说也更加用户友好。您还应该考虑为后续请求显示验证码而不是硬块,以防真实用户看到错误消息,这样您就不会阻止,从而导致合法用户与您联系。
Use Captchas if you suspect that your website is being accessed by a scraper.
如果您怀疑您的网站正在被抓取工具访问,请使用验证码。
Captchas ("Completely Automated Test to Tell Computers and Humans apart") are very effective against stopping scrapers. Unfortunately, they are also very effective at irritating users.
验证码(“完全自动化的测试以区分计算机和人类”)对于阻止抓取工具非常有效。不幸的是,它们在激怒用户方面也非常有效。
As such, they are useful when you suspect a possible scraper, and want to stop the scraping, without also blocking access in case it isn't a scraper but a real user. You might want to consider showing a captcha before allowing access to the content if you suspect a scraper.
因此,当您怀疑可能存在抓取工具并希望停止抓取时,它们非常有用,而且不会阻止访问,以防它不是抓取工具而是真实用户。如果您怀疑是抓取工具,您可能需要考虑在允许访问内容之前显示验证码。
Things to be aware of when using Captchas:
使用验证码时需要注意的事项:
Don't roll your own, use something like Google's reCaptcha: It's a lot easier than implementing a captcha yourself, it's more user-friendly than some blurry and warped text solution you might come up with yourself (users often only need to tick a box), and it's also a lot harder for a scripter to solve than a simple image served from your site
Don't include the solution to the captcha in the HTML markup: I've actually seen one website which had the solution for the captcha in the page itself, (although quite well hidden) thus making it pretty useless. Don't do something like this. Again, use a service like reCaptcha, and you won't have this kind of problem (if you use it properly).
Captchas can be solved in bulk: There are captcha-solving services where actual, low-paid, humans solve captchas in bulk. Again, using reCaptcha is a good idea here, as they have protections (such as the relatively short time the user has in order to solve the captcha). This kind of service is unlikely to be used unless your data is really valuable.
不要自己动手,使用类似 Google 的reCaptcha 之类的东西:这比自己实现验证码要容易得多,它比您自己可能想出的一些模糊和扭曲的文本解决方案更用户友好(用户通常只需要勾选一个框),而且脚本编写者要解决的问题比从您的站点提供的简单图像要困难得多
不要在 HTML 标记中包含验证码的解决方案:我实际上看到过一个网站,它在页面本身中提供了验证码的解决方案(虽然隐藏得很好),因此使其变得毫无用处。不要做这样的事情。同样,使用像 reCaptcha 这样的服务,你就不会遇到这种问题(如果你使用得当)。
验证码可以批量解决:有验证码解决服务,实际的、低薪的人工批量解决验证码。同样,在这里使用 reCaptcha 是一个好主意,因为它们有保护措施(例如用户解决验证码的时间相对较短)。除非您的数据非常有价值,否则不太可能使用这种服务。
Serve your text content as an image
将您的文本内容作为图像提供
You can render text into an image server-side, and serve that to be displayed, which will hinder simple scrapers extracting text.
您可以将文本渲染到图像服务器端,并提供要显示的内容,这将阻碍简单的抓取工具提取文本。
However, this is bad for screen readers, search engines, performance, and pretty much everything else. It's also illegal in some places (due to accessibility, eg. the Americans with Disabilities Act), and it's also easy to circumvent with some OCR, so don't do it.
然而,这对屏幕阅读器、搜索引擎、性能以及几乎所有其他方面都是不利的。在某些地方它也是非法的(由于可访问性,例如美国残疾人法案),并且使用某些 OCR 也很容易规避,所以不要这样做。
You can do something similar with CSS sprites, but that suffers from the same problems.
你可以用 CSS 精灵做一些类似的事情,但会遇到同样的问题。
Don't expose your complete dataset:
不要公开您的完整数据集:
If feasible, don't provide a way for a script / bot to get all of your dataset. As an example: You have a news site, with lots of individual articles. You could make those articles be only accessible by searching for them via the on site search, and, if you don't have a list of allthe articles on the site and their URLs anywhere, those articles will be only accessible by using the search feature. This means that a script wanting to get all the articles off your site will have to do searches for all possible phrases which may appear in your articles in order to find them all, which will be time-consuming, horribly inefficient, and will hopefully make the scraper give up.
如果可行,不要为脚本/机器人提供获取所有数据集的方法。举个例子:你有一个新闻网站,有很多单独的文章。您可以使这些文章只能通过在站点搜索中搜索它们来访问,并且,如果您没有站点上所有文章的列表及其任何地方的 URL,则只能通过使用搜索来访问这些文章特征。这意味着想要从您的网站上删除所有文章的脚本必须搜索可能出现在您的文章中的所有可能的短语才能找到它们,这将非常耗时,效率极低,并且有望使刮板放弃。
This will be ineffective if:
如果出现以下情况,这将无效:
- The bot / script does not want / need the full dataset anyway.
- Your articles are served from a URL which looks something like
example.com/article.php?articleId=12345
. This (and similar things) which will allow scrapers to simply iterate over all thearticleId
s and request all the articles that way. - There are other ways to eventually find all the articles, such as by writing a script to follow links within articles which lead to other articles.
- Searching for something like "and" or "the" can reveal almost everything, so that is something to be aware of. (You can avoid this by only returning the top 10 or 20 results).
- You need search engines to find your content.
- 机器人/脚本无论如何都不需要/需要完整的数据集。
- 您的文章是从类似于
example.com/article.php?articleId=12345
. 这(以及类似的东西)将允许抓取器简单地遍历所有articleId
s 并以这种方式请求所有文章。 - 还有其他方法可以最终找到所有文章,例如通过编写脚本来跟踪文章中指向其他文章的链接。
- 搜索诸如“and”或“the”之类的内容几乎可以揭示所有内容,因此需要注意这一点。(您可以通过仅返回前 10 或 20 个结果来避免这种情况)。
- 您需要搜索引擎来查找您的内容。
Don't expose your APIs, endpoints, and similar things:
不要暴露你的 API、端点和类似的东西:
Make sure you don't expose any APIs, even unintentionally. For example, if you are using AJAX or network requests from within Adobe Flash or Java Applets (God forbid!) to load your data it is trivial to look at the network requests from the page and figure out where those requests are going to, and then reverse engineer and use those endpoints in a scraper program. Make sure you obfuscate your endpoints and make them hard for others to use, as described.
确保您不会公开任何 API,即使是无意的。例如,如果您使用 AJAX 或来自 Adobe Flash 或 Java Applets(上帝保佑!)的网络请求来加载您的数据,那么查看来自页面的网络请求并找出这些请求的去向是微不足道的,并且然后逆向工程并在刮刀程序中使用这些端点。确保您混淆了您的端点,并使其他人难以使用它们,如上所述。
To deter HTML parsers and scrapers:
阻止 HTML 解析器和抓取工具:
Since HTML parsers work by extracting content from pages based on identifiable patterns in the HTML, we can intentionally change those patterns in oder to break these scrapers, or even screw with them. Most of these tips also apply to other scrapers like spiders and screenscrapers too.
由于 HTML 解析器的工作原理是根据 HTML 中可识别的模式从页面中提取内容,因此我们可以有意地更改这些模式以破坏这些抓取工具,甚至搞砸它们。大多数这些技巧也适用于其他抓取工具,如蜘蛛和屏幕抓取工具。
Frequently change your HTML
经常更改您的 HTML
Scrapers which process HTML directly do so by extracting contents from specific, identifiable parts of your HTML page. For example: If all pages on your website have a div
with an id of article-content
, which contains the text of the article, then it is trivial to write a script to visit all the article pages on your site, and extract the content text of the article-content
div on each article page, and voilà, the scraper has all the articles from your site in a format that can be reused elsewhere.
直接处理 HTML 的抓取工具通过从 HTML 页面的特定、可识别部分提取内容来实现。例如:如果您网站上的所有页面都有一个div
id 为 的article-content
,其中包含文章的文本,那么编写一个脚本来访问您网站上的所有文章页面,并提取article-content
div的内容文本是微不足道的在每个文章页面上,瞧,scraper 以可以在其他地方重复使用的格式包含您网站上的所有文章。
If you change the HTML and the structure of your pages frequently, such scrapers will no longer work.
如果您频繁更改 HTML 和页面结构,此类抓取工具将不再起作用。
You can frequently change the id's and classes of elements in your HTML, perhaps even automatically. So, if your
div.article-content
becomes something likediv.a4c36dda13eaf0
, and changes every week, the scraper will work fine initially, but will break after a week. Make sure to change the length of your ids / classes too, otherwise the scraper will usediv.[any-14-characters]
to find the desired div instead. Beware of other similar holes too..If there is no way to find the desired content from the markup, the scraper will do so from the way the HTML is structured. So, if all your article pages are similar in that every
div
inside adiv
which comes after ah1
is the article content, scrapers will get the article content based on that. Again, to break this, you can add / remove extra markup to your HTML, periodically and randomly, eg. adding extradiv
s orspan
s. With modern server side HTML processing, this should not be too hard.
您可以经常更改 HTML 中元素的 id 和类,甚至可以自动更改。因此,如果您
div.article-content
变成类似div.a4c36dda13eaf0
, 并且每周都在更改,则刮板最初可以正常工作,但一周后就会损坏。确保也更改您的 ids/classes 的长度,否则刮板将div.[any-14-characters]
用来查找所需的 div。也要注意其他类似的洞..如果无法从标记中找到所需的内容,则抓取工具将根据 HTML 的结构方式进行查找。因此,如果您所有的文章页面都相似,因为
div
adiv
之后的每个ah1
都是文章内容,抓取工具将根据该内容获取文章内容。同样,为了打破这一点,您可以定期和随机地向 HTML 添加/删除额外的标记,例如。添加额外的div
s 或span
s。使用现代服务器端 HTML 处理,这应该不会太难。
Things to be aware of:
需要注意的事项:
It will be tedious and difficult to implement, maintain, and debug.
You will hinder caching. Especially if you change ids or classes of your HTML elements, this will require corresponding changes in your CSS and JavaScript files, which means that every time you change them, they will have to be re-downloaded by the browser. This will result in longer page load times for repeat visitors, and increased server load. If you only change it once a week, it will not be a big problem.
Clever scrapers will still be able to get your content by inferring where the actual content is, eg. by knowing that a large single block of text on the page is likely to be the actual article. This makes it possible to still find & extract the desired data from the page. Boilerpipedoes exactly this.
实施、维护和调试将是乏味和困难的。
你会阻碍缓存。特别是如果您更改 HTML 元素的 id 或类,这将需要在您的 CSS 和 JavaScript 文件中进行相应的更改,这意味着每次更改它们时,浏览器都必须重新下载它们。这将导致重复访问者的页面加载时间更长,并增加服务器负载。如果一周只换一次,问题不大。
聪明的抓取工具仍然可以通过推断实际内容的位置来获取您的内容,例如。通过知道页面上的一大块文本很可能是实际的文章。这使得仍然可以从页面中查找和提取所需的数据。Boilerpipe正是这样做的。
Essentially, make sure that it is not easy for a script to find the actual, desired content for every similar page.
本质上,确保脚本不容易为每个类似页面找到实际所需的内容。
See also How to prevent crawlers depending on XPath from getting page contentsfor details on how this can be implemented in PHP.
有关如何在 PHP 中实现的详细信息,另请参阅如何防止依赖于 XPath 的爬虫获取页面内容。
Change your HTML based on the user's location
根据用户的位置更改您的 HTML
This is sort of similar to the previous tip. If you serve different HTML based on your user's location / country (determined by IP address), this may break scrapers which are delivered to users. For example, if someone is writing a mobile app which scrapes data from your site, it will work fine initially, but break when it's actually distributed to users, as those users may be in a different country, and thus get different HTML, which the embedded scraper was not designed to consume.
这有点类似于上一个技巧。如果您根据用户的位置/国家/地区(由 IP 地址确定)提供不同的 HTML,这可能会破坏提供给用户的抓取工具。例如,如果有人正在编写一个移动应用程序,它会从您的网站上抓取数据,它最初可以正常工作,但在实际分发给用户时会中断,因为这些用户可能位于不同的国家/地区,因此会获得不同的 HTML,嵌入式刮刀不是为消耗而设计的。
Frequently change your HTML, actively screw with the scrapers by doing so !
经常更改您的 HTML,通过这样做积极地与刮刀搞砸!
An example: You have a search feature on your website, located at example.com/search?query=somesearchquery
, which returns the following HTML:
示例:您的网站上有一个搜索功能,位于example.com/search?query=somesearchquery
,它返回以下 HTML:
<div class="search-result">
<h3 class="search-result-title">Stack Overflow has become the world's most popular programming Q & A website</h3>
<p class="search-result-excerpt">The website Stack Overflow has now become the most popular programming Q & A website, with 10 million questions and many users, which...</p>
<a class"search-result-link" href="/stories/story-link">Read more</a>
</div>
(And so on, lots more identically structured divs with search results)
As you may have guessed this is easy to scrape: all a scraper needs to do is hit the search URL with a query, and extract the desired data from the returned HTML. In addition to periodically changing the HTML as described above, you could also leave the old markup with the old ids and classes in, hide it with CSS, and fill it with fake data, thereby poisoning the scraper.Here's how the search results page could be changed:
您可能已经猜到,这很容易抓取:抓取工具所需要做的就是通过查询点击搜索 URL,然后从返回的 HTML 中提取所需的数据。除了如上所述定期更改 HTML 之外,您还可以保留带有旧 id 和类的旧标记,用 CSS 隐藏它,并用假数据填充它,从而毒害爬虫。以下是更改搜索结果页面的方式:
<div class="the-real-search-result">
<h3 class="the-real-search-result-title">Stack Overflow has become the world's most popular programming Q & A website</h3>
<p class="the-real-search-result-excerpt">The website Stack Overflow has now become the most popular programming Q & A website, with 10 million questions and many users, which...</p>
<a class"the-real-search-result-link" href="/stories/story-link">Read more</a>
</div>
<div class="search-result" style="display:none">
<h3 class="search-result-title">Visit Example.com now, for all the latest Stack Overflow related news !</h3>
<p class="search-result-excerpt">Example.com is so awesome, visit now !</p>
<a class"search-result-link" href="http://example.com/">Visit Now !</a>
</div>
(More real search results follow)
This will mean that scrapers written to extract data from the HTML based on classes or IDs will continue to seemingly work, but they will get fake data or even ads, data which real users will never see, as they're hidden with CSS.
这意味着根据类或 ID 从 HTML 中提取数据的爬虫程序看起来将继续工作,但它们将获得虚假数据甚至广告,这些数据是真实用户永远不会看到的,因为它们被 CSS 隐藏了。
Screw with the scraper: Insert fake, invisible honeypot data into your page
与刮板拧紧:将虚假的、不可见的蜜罐数据插入您的页面
Adding on to the previous example, you can add invisible honeypot items to your HTML to catch scrapers. An example which could be added to the previously described search results page:
添加到前面的示例中,您可以将不可见的蜜罐项目添加到您的 HTML 中以捕获刮刀。可以添加到前面描述的搜索结果页面的示例:
<div class="search-result" style="display:none">
<h3 class="search-result-title">This search result is here to prevent scraping</h3>
<p class="search-result-excerpt">If you're a human and see this, please ignore it. If you're a scraper, please click the link below :-)
Note that clicking the link below will block access to this site for 24 hours.</p>
<a class"search-result-link" href="/scrapertrap/scrapertrap.php">I'm a scraper !</a>
</div>
(The actual, real, search results follow.)
A scraper written to get all the search results will pick this up, just like any of the other, real search results on the page, and visit the link, looking for the desired content. A real human will never even see it in the first place (due to it being hidden with CSS), and won't visit the link. A genuine and desirable spider such as Google's will not visit the link either because you disallowed /scrapertrap/
in your robots.txt.
一个为获取所有搜索结果而编写的抓取工具会选择它,就像页面上的任何其他真实搜索结果一样,并访问链接,寻找所需的内容。一个真正的人甚至永远不会看到它(因为它被 CSS 隐藏了),也不会访问链接。由于您/scrapertrap/
在 robots.txt 中不允许,一个真正的和可取的蜘蛛(例如 Google 的蜘蛛)也不会访问该链接。
You can make your scrapertrap.php
do something like block access for the IP address that visited it or force a captcha for all subsequent requests from that IP.
您可以scrapertrap.php
执行一些操作,例如阻止访问它的 IP 地址的访问,或强制对来自该 IP 的所有后续请求进行验证码。
Don't forget to disallow your honeypot (
/scrapertrap/
) in your robots.txt file so that search engine bots don't fall into it.You can / should combine this with the previous tip of changing your HTML frequently.
Change this frequently too, as scrapers will eventually learn to avoid it. Change the honeypot URL and text. Also want to consider changing the inline CSS used for hiding, and use an ID attribute and external CSS instead, as scrapers will learn to avoid anything which has a
style
attribute with CSS used to hide the content. Also try only enabling it sometimes, so the scraper works initially, but breaks after a while. This also applies to the previous tip.Malicious people can prevent access for real users by sharing a link to your honeypot, or even embedding that link somewhere as an image (eg. on a forum). Change the URL frequently, and make any ban times relatively short.
不要忘记
/scrapertrap/
在 robots.txt 文件中禁止您的蜜罐 ( ),以免搜索引擎机器人落入其中。您可以/应该将此与之前频繁更改 HTML 的技巧结合起来。
也要经常更改这一点,因为刮板者最终会学会避免它。更改蜜罐 URL 和文本。还想考虑更改用于隐藏的内联 CSS,并改用 ID 属性和外部 CSS,因为刮板将学会避免任何具有
style
用于隐藏内容的 CSS 属性的内容。也尝试只启用它有时,所以刮板最初可以工作,但一段时间后会中断。这也适用于之前的提示。恶意人员可以通过共享指向您的蜜罐的链接,甚至将该链接作为图像嵌入某处(例如在论坛上)来阻止真实用户的访问。经常更改 URL,并使任何禁止时间相对较短。
Serve fake and useless data if you detect a scraper
如果您检测到刮刀,则提供虚假和无用的数据
If you detect what is obviously a scraper, you can serve up fake and useless data; this will corrupt the data the scraper gets from your website. You should also make it impossible to distinguish such fake data from real data, so that scrapers don't know that they're being screwed with.
如果您检测到明显是刮刀的东西,您就可以提供虚假和无用的数据;这将破坏刮板从您的网站获取的数据。您还应该使这些虚假数据与真实数据无法区分开来,以便抓取者不知道他们被搞砸了。
As an example: you have a news website; if you detect a scraper, instead of blocking access, serve up fake, randomly generatedarticles, and this will poison the data the scraper gets. If you make your fake data indistinguishable from the real thing, you'll make it hard for scrapers to get what they want, namely the actual, real data.
例如:您有一个新闻网站;如果您检测到爬虫,而不是阻止访问,而是提供虚假的、随机生成的文章,这将毒害爬虫获取的数据。如果你让你的虚假数据与真实数据无法区分,你将很难让抓取者获得他们想要的东西,即真实的、真实的数据。
Don't accept requests if the User Agent is empty / missing
如果用户代理为空/丢失,则不接受请求
Often, lazily written scrapers will not send a User Agent header with their request, whereas all browsers as well as search engine spiders will.
通常,懒惰编写的爬虫不会随其请求发送用户代理标头,而所有浏览器和搜索引擎蜘蛛都会。
If you get a request where the User Agent header is not present, you can show a captcha, or simply block or limit access. (Or serve fake data as described above, or something else..)
如果您收到不存在用户代理标头的请求,您可以显示验证码,或者简单地阻止或限制访问。(或者提供如上所述的虚假数据,或其他东西..)
It's trivial to spoof, but as a measure against poorly written scrapers it is worth implementing.
欺骗是微不足道的,但作为防止写得不好的爬虫的措施,它是值得实施的。
Don't accept requests if the User Agent is a common scraper one; blacklist ones used by scrapers
如果用户代理是常见的抓取工具,则不要接受请求;刮板使用的黑名单
In some cases, scrapers will use a User Agent which no real browser or search engine spider uses, such as:
在某些情况下,抓取工具会使用一个用户代理,而真正的浏览器或搜索引擎蜘蛛不会使用它,例如:
- "Mozilla" (Just that, nothing else. I've seen a few questions about scraping here, using that. A real browser will never use only that)
- "Java 1.7.43_u43" (By default, Java's HttpUrlConnection uses something like this.)
- "BIZCO EasyScraping Studio 2.0"
- "wget", "curl", "libcurl",.. (Wget and cURL are sometimes used for basic scraping)
- “Mozilla”(仅此而已,仅此而已。我在这里看到了一些关于抓取的问题,使用它。真正的浏览器永远不会只使用它)
- “Java 1.7.43_u43”(默认情况下,Java 的 HttpUrlConnection 使用类似的东西。)
- “BIZCO EasyScraping Studio 2.0”
- "wget", "curl", "libcurl",..(Wget 和 cURL 有时用于基本抓取)
If you find that a specific User Agent string is used by scrapers on your site, and it is not used by real browsers or legitimate spiders, you can also add it to your blacklist.
如果您发现您网站上的爬虫使用了特定的用户代理字符串,而真实浏览器或合法蜘蛛并未使用该字符串,您也可以将其添加到您的黑名单中。
If it doesn't request assets (CSS, images), it's not a real browser.
如果它不请求资产(CSS、图像),则它不是真正的浏览器。
A real browser will (almost always) request and download assets such as images and CSS. HTML parsers and scrapers won't as they are only interested in the actual pages and their content.
真正的浏览器(几乎总是)会请求和下载诸如图像和 CSS 之类的资产。HTML 解析器和抓取器不会,因为它们只对实际页面及其内容感兴趣。
You could log requests to your assets, and if you see lots of requests for only the HTML, it may be a scraper.
您可以记录对您的资产的请求,如果您看到大量仅针对 HTML 的请求,则它可能是一个抓取工具。
Beware that search engine bots, ancient mobile devices, screen readers and misconfigured devices may not request assets either.
请注意,搜索引擎机器人、古老的移动设备、屏幕阅读器和配置错误的设备也可能不会请求资产。
Use and require cookies; use them to track user and scraper actions.
使用和需要cookies;使用它们来跟踪用户和刮刀操作。
You can require cookies to be enabled in order to view your website. This will deter inexperienced and newbie scraper writers, however it is easy to for a scraper to send cookies. If you do use and require them, you can track user and scraper actions with them, and thus implement rate-limiting, blocking, or showing captchas on a per-user instead of a per-IP basis.
您可以要求启用 cookie 以查看您的网站。这将阻止没有经验和新手的刮板作者,但是刮板很容易发送 cookie。如果您确实使用并需要它们,您可以使用它们跟踪用户和抓取工具的操作,从而基于每个用户而不是每个 IP 实施速率限制、阻止或显示验证码。
For example: when the user performs search, set a unique identifying cookie. When the results pages are viewed, verify that cookie. If the user opens all the search results (you can tell from the cookie), then it's probably a scraper.
例如:当用户进行搜索时,设置一个唯一的识别cookie。查看结果页面时,请验证该 cookie。如果用户打开所有搜索结果(您可以从 cookie 中看出),那么它可能是一个刮板。
Using cookies may be ineffective, as scrapers can send the cookies with their requests too, and discard them as needed. You will also prevent access for real users who have cookies disabled, if your site only works with cookies.
使用 cookie 可能无效,因为抓取工具也可以随请求发送 cookie,并根据需要丢弃它们。如果您的网站仅使用 cookie,您还将阻止禁用 cookie 的真实用户访问。
Note that if you use JavaScript to set and retrieve the cookie, you'll block scrapers which don't run JavaScript, since they can't retrieve and send the cookie with their request.
请注意,如果您使用 JavaScript 设置和检索 cookie,您将阻止不运行 JavaScript 的抓取工具,因为它们无法检索和发送 cookie 及其请求。
Use JavaScript + Ajax to load your content
使用 JavaScript + Ajax 加载您的内容
You could use JavaScript + AJAX to load your content after the page itself loads. This will make the content inaccessible to HTML parsers which do not run JavaScript. This is often an effective deterrent to newbie and inexperienced programmers writing scrapers.
您可以在页面本身加载后使用 JavaScript + AJAX 加载您的内容。这将使不运行 JavaScript 的 HTML 解析器无法访问内容。这对于编写爬虫程序的新手和没有经验的程序员来说通常是一种有效的威慑。
Be aware of:
意识到:
Using JavaScript to load the actual content will degrade user experience and performance
Search engines may not run JavaScript either, thus preventing them from indexing your content. This may not be a problem for search results pages, but may be for other things, such as article pages.
使用 JavaScript 加载实际内容会降低用户体验和性能
搜索引擎也可能不会运行 JavaScript,从而阻止它们将您的内容编入索引。这可能不是搜索结果页面的问题,但可能是其他内容的问题,例如文章页面。
Obfuscate your markup, network requests from scripts, and everything else.
混淆您的标记、来自脚本的网络请求以及其他所有内容。
If you use Ajax and JavaScript to load your data, obfuscate the data which is transferred. As an example, you could encode your data on the server (with something as simple as base64 or more complex), and then decode and display it on the client, after fetching via Ajax. This will mean that someone inspecting network traffic will not immediately see how your page works and loads data, and it will be tougher for someone to directly request request data from your endpoints, as they will have to reverse-engineer your descrambling algorithm.
如果您使用 Ajax 和 JavaScript 加载数据,请混淆传输的数据。例如,您可以在服务器上对数据进行编码(使用像 base64 或更复杂的东西一样简单),然后在通过 Ajax 获取后解码并在客户端上显示它。这意味着检查网络流量的人不会立即看到您的页面如何工作和加载数据,并且有人直接从您的端点请求请求数据将更加困难,因为他们必须对您的解扰算法进行逆向工程。
If you do use Ajax for loading the data, you should make it hard to use the endpoints without loading the page first, eg by requiring some session key as a parameter, which you can embed in your JavaScript or your HTML.
You can also embed your obfuscated data directly in the initial HTML page and use JavaScript to deobfuscate and display it, which would avoid the extra network requests. Doing this will make it significantly harder to extract the data using a HTML-only parser which does not run JavaScript, as the one writing the scraper will have to reverse engineer your JavaScript (which you should obfuscate too).
You might want to change your obfuscation methods regularly, to break scrapers who have figured it out.
如果您确实使用 Ajax 加载数据,您应该在不首先加载页面的情况下使端点难以使用,例如,需要一些会话密钥作为参数,您可以将其嵌入到您的 JavaScript 或 HTML 中。
您还可以将混淆后的数据直接嵌入初始 HTML 页面,并使用 JavaScript 对其进行去混淆和显示,这样可以避免额外的网络请求。这样做会使使用不运行 JavaScript 的纯 HTML 解析器提取数据变得更加困难,因为编写刮板的人必须对您的 JavaScript 进行逆向工程(您也应该对其进行混淆)。
您可能希望定期更改您的混淆方法,以打破已经弄清楚的抓取工具。
There are several disadvantages to doing something like this, though:
但是,这样做有几个缺点:
It will be tedious and difficult to implement, maintain, and debug.
It will be ineffective against scrapers and screenscrapers which actually run JavaScript and then extract the data. (Most simple HTML parsers don't run JavaScript though)
It will make your site nonfunctional for real users if they have JavaScript disabled.
Performance and page-load times will suffer.
实施、维护和调试将是乏味和困难的。
它对实际运行 JavaScript 然后提取数据的抓取工具和屏幕抓取工具无效。(尽管大多数简单的 HTML 解析器不运行 JavaScript)
如果真实用户禁用了 JavaScript,这将使您的网站无法正常运行。
性能和页面加载时间将受到影响。
Non-Technical:
非技术:
Tell people not to scrape, and some will respect it
Find a lawyer
Make your data available, provide an API:
You could make your data easily available and require attribution and a link back to your site. Perhaps charge $$$ for it.
告诉人们不要刮,有些人会尊重它
找律师
使您的数据可用,提供一个 API:
您可以轻松获取您的数据,并要求提供归属地和指向您网站的链接。也许为此收取$$$。
Miscellaneous:
各种各样的:
There are also commercial scraping protection services, such as the anti-scraping by Cloudflare or Distill Networks(Details on how it works here), which do these things, and more for you.
Find a balance between usability for real users and scraper-proofness: Everything you do will impact user experience negatively in one way or another, find compromises.
Don't forget your mobile site and apps. If you have a mobile app, that can be screenscraped too, and network traffic can be inspected to determine the REST endpoints it uses.
Scrapers can scrape other scrapers: If there's one website which has content scraped from yours, other scrapers can scrape from that scraper's website.
还有一些商业抓取保护服务,例如 Cloudflare 或Distill Networks的反抓取(此处详细介绍了它的工作原理),它们可以为您做这些事情,并且为您提供更多服务。
在真实用户的可用性和防刮擦性之间找到平衡:你所做的一切都会以一种或另一种方式对用户体验产生负面影响,找到妥协。
不要忘记您的移动网站和应用程序。如果您有一个移动应用程序,也可以对其进行屏幕抓取,并且可以检查网络流量以确定它使用的 REST 端点。
抓取工具可以抓取其他抓取工具:如果有一个网站的内容从您的网站抓取,则其他抓取工具可以从该抓取工具的网站抓取。
Further reading:
进一步阅读:
Wikipedia's article on Web scraping. Many details on the technologies involved and the different types of web scraper.
Stopping scripters from slamming your website hundreds of times a second. Q & A on a very similar problem - bots checking a website and buying things as soon as they go on sale. A lot of relevant info, esp. on Captchas and rate-limiting.
维基百科关于网页抓取的文章。关于所涉及的技术和不同类型的网络爬虫的许多细节。
阻止脚本编写者每秒数百次地抨击您的网站。关于一个非常相似的问题的问答 - 机器人检查网站并在商品上市后立即购买。很多相关信息,尤其是。关于验证码和速率限制。
回答by Daniel Trebbien
I will presume that you have set up robots.txt
.
我假设您已经设置了robots.txt
.
As others have mentioned, scrapers can fake nearly every aspect of their activities, and it is probably very difficult to identify the requests that are coming from the bad guys.
正如其他人所提到的,抓取工具几乎可以伪造其活动的每个方面,并且可能很难识别来自坏人的请求。
I would consider:
我会考虑:
- Set up a page,
/jail.html
. - Disallow access to the page in
robots.txt
(so the respectful spiders will never visit). - Place a link on one of your pages, hiding it with CSS (
display: none
). - Record IP addresses of visitors to
/jail.html
.
- 设置一个页面,
/jail.html
。 - 禁止访问页面
robots.txt
(因此尊敬的蜘蛛将永远不会访问)。 - 在其中一个页面上放置一个链接,用 CSS (
display: none
)隐藏它。 - 记录访问者的 IP 地址到
/jail.html
.
This might help you to quickly identify requests from scrapers that are flagrantly disregarding your robots.txt
.
这可能会帮助您快速识别来自公然无视您的robots.txt
.
You might also want to make your /jail.html
a whole entire website that has the same, exact markup as normal pages, but with fake data (/jail/album/63ajdka
, /jail/track/3aads8
, etc.). This way, the bad scrapers won't be alerted to "unusual input" until you have the chance to block them entirely.
你可能也想使你的/jail.html
整个整个网站具有相同的,准确的标记为正常的网页,而是用假数据(/jail/album/63ajdka
,/jail/track/3aads8
等)。这样,在您有机会完全阻止它们之前,不会向不良抓取工具发出“异常输入”警报。
回答by Unicron
Sue 'em.
告他们。
Seriously: If you have some money, talk to a good, nice, young lawyer who knows their way around the Internets. You could really be able to do something here. Depending on where the sites are based, you could have a lawyer write up a cease & desist or its equivalent in your country. You may be able to at least scare the bastards.
认真地说:如果你有一些钱,请与一位熟悉互联网的优秀、友善、年轻的律师交谈。你真的可以在这里做点什么。根据网站所在的位置,您可以让律师在您所在的国家/地区写下停止和停止或类似的内容。你至少可以吓唬那些混蛋。
Document the insertion of your dummy values. Insert dummy values that clearly (but obscurely) point to you. I think this is common practice with phone book companies, and here in Germany, I think there have been several instances when copycats got busted through fake entries they copied 1:1.
记录您的虚拟值的插入。插入明确(但模糊)指向您的虚拟值。我认为这是电话簿公司的常见做法,在德国,我认为有几次模仿者因 1:1 复制的虚假条目而被破坏的情况。
It would be a shame if this would drive you into messing up your HTML code, dragging down SEO, validity and other things (even though a templating system that uses a slightly different HTML structure on each request for identical pages might already help a lotagainst scrapers that always rely on HTML structures and class/ID names to get the content out.)
这将是一种耻辱,如果这会把你变成搞乱你的HTML代码,拖累SEO,有效性和其他东西(即使是对同一页面的每个请求使用了稍微不同的HTML结构的模板系统可能已经帮助很多反对总是依赖 HTML 结构和类/ID 名称来获取内容的抓取工具。)
Cases like this are what copyright laws are good for. Ripping off other people's honest work to make money with is something that you should be able to fight against.
像这样的案例正是版权法所擅长的。为了赚钱而剽窃别人的诚实工作是你应该能够反对的。
回答by ryeguy
There is really nothing you can do to completely prevent this. Scrapers can fake their user agent, use multiple IP addresses, etc. and appear as a normal user. The only thing you can do is make the text not available at the time the page is loaded - make it with image, flash, or load it with JavaScript. However, the first two are bad ideas, and the last one would be an accessibility issue if JavaScript is not enabled for some of your regular users.
你真的没有什么可以完全防止这种情况发生。爬虫可以伪造他们的用户代理,使用多个 IP 地址等,并以普通用户的身份出现。您唯一能做的就是在页面加载时使文本不可用 - 使用图像、Flash 或 JavaScript 加载。然而,前两个是坏主意,如果您的一些普通用户没有启用 JavaScript,最后一个将是一个可访问性问题。
If they are absolutely slamming your site and rifling through all of your pages, you could do some kind of rate limiting.
如果他们完全抨击您的网站并浏览您的所有页面,您可以进行某种速率限制。
There is some hope though. Scrapers rely on your site's data being in a consistent format. If you could randomize it somehow it could break their scraper. Things like changing the ID or class names of page elements on each load, etc. But that is a lot of work to do and I'm not sure if it's worth it. And even then, they could probably get around it with enough dedication.
不过还是有一些希望的。抓取工具依赖于您网站的数据格式一致。如果你能以某种方式随机化它,它可能会破坏他们的刮刀。诸如在每次加载时更改页面元素的 ID 或类名等事情。但这是很多工作要做,我不确定是否值得。即便如此,他们也可能会以足够的奉献精神解决它。
回答by Williham Totland
Provide an XML API to access your data; in a manner that is simple to use. If people want your data, they'll get it, you might as well go all out.
提供 XML API 来访问您的数据;以一种易于使用的方式。如果人们想要你的数据,他们会得到它,你不妨全力以赴。
This way you can provide a subset of functionality in an effective manner, ensuring that, at the very least, the scrapers won't guzzle up HTTP requests and massive amounts of bandwidth.
通过这种方式,您可以以有效的方式提供功能的子集,至少确保爬虫不会消耗 HTTP 请求和大量带宽。
Then all you have to do is convince the people who want your data to use the API. ;)
然后你所要做的就是说服想要你的数据的人使用 API。;)
回答by Lizard
Sorry, it's really quite hard to do this...
对不起,这真的很难做到......
I would suggest that you politely ask them to not use your content (if your content is copyrighted).
我建议您礼貌地要求他们不要使用您的内容(如果您的内容受版权保护)。
If it is and they don't take it down, then you can take furthur action and send them a cease and desist letter.
如果是并且他们没有取消,那么您可以采取进一步行动并向他们发送停止和终止信函。
Generally, whatever you do to prevent scraping will probably end up with a more negative effect, e.g. accessibility, bots/spiders, etc.
通常,无论您为防止抓取而做什么,最终都可能会产生更负面的影响,例如可访问性、机器人/蜘蛛等。
回答by Arshdeep
Okay, as all posts say, if you want to make it search engine-friendly then bots can scrape for sure.
好的,正如所有帖子所说,如果你想让它对搜索引擎友好,那么机器人肯定可以抓取。
But you can still do a few things, and it may be affective for 60-70 % scraping bots.
但是你仍然可以做一些事情,它可能对 60-70% 的抓取机器人有效。
Make a checker script like below.
制作一个如下所示的检查脚本。
If a particular IP address is visiting very fast then after a few visits (5-10) put its IP address + browser information in a file or database.
如果某个特定 IP 地址的访问速度非常快,那么在几次访问后 (5-10) 将其 IP 地址 + 浏览器信息放入文件或数据库中。
The next step
下一步
(This would be a background process and running all time or scheduled after a few minutes.) Make one another script that will keep on checking those suspicious IP addresses.
(这将是一个后台进程并一直运行或在几分钟后安排。)制作另一个脚本来继续检查那些可疑的 IP 地址。
Case 1. If the user Agent is of a known search engine like Google, Bing, Yahoo(you can find more information on user agents by googling it). Then you must see http://www.iplists.com/. This list and try to match patterns. And if it seems like a faked user-agent then ask to fill in a CAPTCHAon the next visit. (You need to research a bit more on bots IP addresses. I know this is achievable and also try whois of the IP address. It can be helpful.)
案例 1. 如果用户代理是已知的搜索引擎,如谷歌、必应、雅虎(您可以通过谷歌搜索找到有关用户代理的更多信息)。然后你必须看到http://www.iplists.com/。此列表并尝试匹配模式。如果它看起来像一个伪造的用户代理,则要求在下次访问时填写验证码。(您需要对机器人 IP 地址进行更多研究。我知道这是可以实现的,还可以尝试使用 IP 地址的 whois。这可能会有所帮助。)
Case 2. No user agent of a search bot: Simply ask to fill in a CAPTCHA on the next visit.
案例 2. 没有搜索机器人的用户代理:只需在下次访问时要求填写 CAPTCHA。
回答by jm666
Late answer - and also this answer probably isn't the one you want to hear...
迟到的答案 - 而且这个答案可能不是你想听到的......
Myself already wrote many(many tens) of different specializeddata-mining scrapers. (just because I like the "open data" philosophy).
我自己已经写了很多(几十个)不同的专业数据挖掘爬虫。(只是因为我喜欢“开放数据”哲学)。
Here are already many advices in other answers - now i will play the devil's advocate roleand will extend and/or correct their effectiveness.
这里已经在其他答案中提供了许多建议 -现在我将扮演魔鬼的拥护者角色,并将扩展和/或纠正它们的有效性。
First:
第一的:
- if someone reallywants your data
- you can'teffectively (technically) hide your data
- if the data should bepublicly accessible to your "regular users"
- 如果有人真的想要你的数据
- 你不能有效地(技术上)隐藏你的数据
- 如果您的“普通用户”应该可以公开访问数据
Trying to use some technical barriers aren't worth the troubles, caused:
尝试使用一些技术障碍是不值得的,导致:
- to your regular users by worsening their user-experience
- to regular and welcomed bots (search engines)
- etc...
- 通过恶化他们的用户体验来给你的普通用户
- 到常规和受欢迎的机器人(搜索引擎)
- 等等...
Plain HMTL- the easiest way is parse the plain HTML pages, with well defined structure and css classes. E.g. it is enough to inspect element with Firebug, and use the right Xpaths, and/or CSS path in my scraper.
纯HMTL- 最简单的方法是解析纯 HTML 页面,具有明确定义的结构和 css 类。例如,使用 Firebug 检查元素并在我的刮板中使用正确的 Xpath 和/或 CSS 路径就足够了。
You could generate the HTML structure dynamically and also, you can generate dynamically the CSS class-names (and the CSS itself too) (e.g. by using some random class names) - but
您可以动态生成 HTML 结构,也可以动态生成 CSS 类名(以及 CSS 本身)(例如,通过使用一些随机类名) - 但是
- you want to present the informations to your regular users in consistent way
- e.g. again - it is enough to analyze the page structure once more to setup the scraper.
- and it can be done automatically by analyzing some "already known content"
- once someone already knows (by earlier scrape), e.g.:
- what contains the informations about "phil collins"
- enough display the "phil collins" page and (automatically) analyze how the page is structured "today" :)
- 您希望以一致的方式将信息呈现给您的普通用户
- 例如再次 - 再次分析页面结构以设置刮板就足够了。
- 并且可以通过分析一些“已知内容”来自动完成
- 一旦有人已经知道(通过较早的刮擦),例如:
- 什么包含有关“菲尔柯林斯”的信息
- 足以显示“phil collins”页面并(自动)分析“今天”页面的结构:)
You can't change the structure for every response, because your regular users will hate you. Also, this will cause more troubles for you (maintenance) not for the scraper. The XPath or CSS path is determinable by the scraping script automatically from the known content.
您无法更改每个响应的结构,因为您的普通用户会讨厌您。此外,这会给您(维护)带来更多麻烦,而不是刮板。XPath 或 CSS 路径可由抓取脚本根据已知内容自动确定。
Ajax- little bit harder in the start, but many times speeds up the scraping process :) - why?
Ajax- 一开始有点难,但很多时候会加快抓取过程:) - 为什么?
When analyzing the requests and responses, i just setup my own proxy server (written in perl) and my firefox is using it. Of course, because it is my own proxy - it is completely hidden - the target server see it as regular browser. (So, no X-Forwarded-for and such headers). Based on the proxy logs, mostly is possible to determine the "logic" of the ajax requests, e.g. i could skip most of the html scraping, and just use the well-structured ajax responses (mostly in JSON format).
在分析请求和响应时,我只是设置了我自己的代理服务器(用 perl 编写)并且我的 Firefox 正在使用它。当然,因为它是我自己的代理 - 它是完全隐藏的 - 目标服务器将其视为常规浏览器。(因此,没有 X-Forwarded-for 和此类标题)。基于代理日志,大多数情况下可以确定 ajax 请求的“逻辑”,例如我可以跳过大部分 html 抓取,而只使用结构良好的 ajax 响应(主要是 JSON 格式)。
So, the ajaxdoesn't helps much...
所以,ajax没有太大帮助......
Some more complicated are pages which uses muchpacked javascript functions.
一些更复杂的是使用大量javascript 函数的页面。
Here is possible to use two basic methods:
这里可以使用两种基本方法:
- unpack and understand the JS and create a scraper which follows the Javascript logic (the hard way)
- or (preferably using by myself) - just using Mozilla with Mozreplfor scrape. E.g. the real scraping is done in full featured javascript enabled browser, which is programmed to clicking to the right elements and just grabbing the "decoded" responses directly from the browser window.
- 解压并理解 JS 并创建一个遵循 Javascript 逻辑的刮刀(困难的方式)
- 或(最好由我自己使用) - 仅使用 Mozilla 和Mozrepl进行抓取。例如,真正的抓取是在全功能的启用 javascript 的浏览器中完成的,该浏览器被编程为单击正确的元素并直接从浏览器窗口获取“解码”响应。
Such scraping is slow (the scraping is done as in regular browser), but it is
这种抓取很慢(抓取是在常规浏览器中完成的),但它是
- very easy to setup and use
- and it is nearly impossible to counter it :)
- and the "slowness" is needed anyway to counter the "blocking the rapid same IP based requests"
- 非常容易设置和使用
- 几乎不可能反击它:)
- 无论如何都需要“缓慢”来应对“阻止基于相同 IP 的快速请求”
The User-Agentbased filtering doesn't helps at all. Any serious data-miner will set it to some correct one in his scraper.
基于用户代理的过滤根本没有帮助。任何认真的数据挖掘者都会在他的抓取工具中将其设置为正确的。
Require Login- doesn't helps. The simplest way beat it (without any analyze and/or scripting the login-protocol) is just logging into the site as regular user, using Mozilla and after just run the Mozrepl based scraper...
需要登录- 没有帮助。击败它的最简单方法(无需任何分析和/或编写登录协议脚本)就是以普通用户身份登录站点,使用 Mozilla,然后运行基于 Mozrepl 的抓取工具...
Remember, the require loginhelps for anonymous bots, but doesn't helps against someone who want scrape your data. He just register himself to your site as regular user.
请记住,要求登录对匿名机器人有帮助,但对想要抓取您数据的人没有帮助。他只是将自己注册到您的网站作为普通用户。
Using framesisn't very effective also. This is used by many live movie services and it not very hard to beat. The frames are simply another one HTML/Javascript pages what are needed to analyze... If the data worth the troubles - the data-miner will do the required analyze.
使用框架也不是很有效。许多现场电影服务都使用它,而且它并不难被击败。框架只是另一个需要分析的 HTML/Javascript 页面......如果数据值得麻烦 - 数据挖掘者将进行所需的分析。
IP-based limitingisn't effective at all - here are too many public proxy servers and also here is the TOR... :) It doesn't slows down the scraping (for someone who reallywants your data).
基于 IP 的限制根本没有效果——这里有太多的公共代理服务器,还有 TOR... :) 它不会减慢抓取速度(对于那些真正想要你的数据的人)。
Very hard is scrape data hidden in images. (e.g. simply converting the data into images server-side). Employing "tesseract" (OCR) helps many times - but honestly - the data must worth the troubles for the scraper. (which many times doesn't worth).
很难抓取隐藏在图像中的数据。(例如,简单地将数据转换为服务器端的图像)。使用“tesseract”(OCR)多次有帮助——但老实说——数据必须值得刮板的麻烦。(很多时候不值得)。
On the other side, your users will hate you for this. Myself, (even when not scraping) hate websites which doesn't allows copy the page content into the clipboard (because the information are in the images, or (the silly ones) trying to bond to the right click some custom Javascript event. :)
另一方面,您的用户会因此而讨厌您。我自己(即使不抓取)讨厌不允许将页面内容复制到剪贴板的网站(因为信息在图像中,或者(愚蠢的)试图绑定到右键单击某些自定义 Javascript 事件。: )
The hardest are the sites which using java applets or flash, and the applet uses secure httpsrequests itself internally. But think twice - how happy will be your iPhone users... ;). Therefore, currently very few sites using them. Myself, blocking all flash content in my browser (in regular browsing sessions) - and never using sites which depends on Flash.
最难的是其使用网站的Java小程序或Flash,和小程序使用安全的HTTPS内部请求本身。但请三思——你的 iPhone 用户会多么开心......;)。因此,目前很少有网站使用它们。我自己,在我的浏览器中阻止所有 Flash 内容(在常规浏览会话中) - 并且从不使用依赖于 Flash 的网站。
Your milestones could be..., so you can try this method - just remember - you will probably loose some of your users. Also remember, some SWF files are decompilable. ;)
您的里程碑可能是……,所以您可以尝试这种方法——请记住——您可能会失去一些用户。另请记住,某些 SWF 文件是可反编译的。;)
Captcha(the good ones - like reCaptcha) helps a lot - but your users will hate you... - just imagine, how your users will love you when they need solve some captchas in all pages showing informations about the music artists.
Captcha(好的 - 像 reCaptcha)有很大帮助 - 但你的用户会讨厌你...... - 想象一下,当你的用户需要解决所有页面中显示音乐艺术家信息的一些验证码时,他们会如何爱你。
Probably don't need to continue - you already got into the picture.
可能不需要继续 - 你已经进入了画面。
Now what you should do:
现在你应该做什么:
Remember: It is nearly impossible to hide your data, if you on the other side want publish them (in friendly way) to your regular users.
请记住:如果您想将数据(以友好的方式)发布给您的普通用户,那么隐藏您的数据几乎是不可能的。
So,
所以,
- make your data easily accessible - by some API
- this allows the easy data access
- e.g. offload your server from scraping - good for you
- setup the right usage rights (e.g. for example must cite the source)
- remember, many data isn't copyright-able - and hard to protect them
- add some fake data (as you already done) and use legal tools
- as others already said, send an "cease and desist letter"
- other legal actions (sue and like) probably is too costly and hard to win (especially against non US sites)
- 使您的数据易于访问 - 通过某些 API
- 这允许轻松访问数据
- 例如从抓取中卸载您的服务器 - 对您有好处
- 设置正确的使用权限(例如,必须引用来源)
- 请记住,许多数据不受版权保护 - 并且很难保护它们
- 添加一些虚假数据(正如您已经完成的那样)并使用法律工具
- 正如其他人已经说过的,发送“停止和终止信”
- 其他法律诉讼(起诉等)可能成本太高且难以胜诉(尤其是针对非美国网站)
Think twice before you will try to use some technical barriers.
在尝试使用一些技术壁垒之前,请三思。
Rather as trying block the data-miners, just add more efforts to your website usability. Your user will love you. The time (&energy) invested into technical barriers usually aren't worth - better to spend the time to make even better website...
与其试图阻止数据挖掘者,不如为您的网站可用性增加更多的努力。你的用户会喜欢你。在技术壁垒上投入的时间(和精力)通常是不值得的——最好花时间制作更好的网站......
Also, data-thieves aren't like normal thieves.
此外,数据窃贼与普通窃贼不同。
If you buy an inexpensive home alarm and add an warning "this house is connected to the police" - many thieves will not even try to break into. Because one wrong move by him - and he going to jail...
如果你买一个便宜的家庭报警器并加上“这所房子与警察相连”的警告——许多小偷甚至不会试图闯入。因为他的一个错误举动——他会进监狱……
So, you investing only few bucks, but the thief investing and risk much.
所以,你投资只有几块钱,但小偷投资风险很大。
But the data-thief hasn't such risks. just the opposite - ff you make one wrong move (e.g. if you introduce some BUG as a result of technical barriers), you will loose your users. If the the scraping bot will not work for the first time, nothing happens - the data-miner just will try another approach and/or will debug the script.
但数据窃贼没有这样的风险。恰恰相反 - 如果你做了一个错误的举动(例如,如果你因为技术障碍而引入了一些 BUG),你就会失去你的用户。如果抓取机器人第一次无法工作,则不会发生任何事情 - 数据挖掘器只会尝试另一种方法和/或调试脚本。
In this case, you need invest much more - and the scraper investing much less.
在这种情况下,您需要投资更多——而刮刀投资更少。
Just think where you want invest your time & energy...
想想你想把时间和精力投资在哪里...
Ps: english isn't my native - so forgive my broken english...
Ps:英语不是我的母语 - 所以原谅我破碎的英语......
回答by STW
Your best option is unfortunately fairly manual: Look for traffic patterns that you believe are indicative of scraping and ban their IP addresses.
不幸的是,您最好的选择是相当手动:查找您认为表明抓取和禁止其 IP 地址的流量模式。
Since you're talking about a public site then making the site search-engine friendly will also make the site scraping-friendly. If a search-engine can crawl and scrape your site then an malicious scraper can as well. It's a fine-line to walk.
既然您在谈论公共站点,那么使站点对搜索引擎友好也会使站点对抓取友好。如果搜索引擎可以抓取和抓取您的网站,那么恶意抓取工具也可以。这是一条很好的步行路线。
回答by dengeltrees
From a tech perspective: Just model what Google does when you hit them with too many queries at once. That should put a halt to a lot of it.
从技术的角度来看:当你一次用太多的查询点击谷歌时,只需模拟谷歌会做什么。这应该会阻止很多事情。
From a legal perspective: It sounds like the data you're publishing is not proprietary. Meaning you're publishing names and stats and other information that cannot be copyrighted.
从法律角度来看:听起来您发布的数据不是专有的。这意味着您正在发布名称和统计数据以及其他不受版权保护的信息。
If this is the case, the scrapers are not violating copyright by redistributing your information about artist name etc. However, they may be violating copyright when they load your site into memory because your site contains elements that are copyrightable (like layout etc).
如果是这种情况,则抓取工具不会通过重新分发您的艺术家姓名等信息而侵犯版权。但是,当它们将您的网站加载到内存中时,它们可能会侵犯版权,因为您的网站包含受版权保护的元素(如布局等)。
I recommend reading about Facebook v. Power.com and seeing the arguments Facebook used to stop screen scraping. There are many legal ways you can go about trying to stop someone from scraping your website. They can be far reaching and imaginative. Sometimes the courts buy the arguments. Sometimes they don't.
我建议阅读 Facebook v. Power.com 并查看 Facebook 用于停止屏幕抓取的论据。您可以通过多种合法方式尝试阻止某人抓取您的网站。它们可以影响深远且富有想象力。有时法院会购买论据。有时他们不会。
But, assuming you're publishing public domain information that's not copyrightable like names and basic stats... you should just let it go in the name of free speech and open data. That is, what the web's all about.
但是,假设您要发布不受版权保护的公共领域信息,例如名称和基本统计数据……您应该以言论自由和开放数据的名义放手。也就是说,网络的全部内容是什么。