javascript 基于浏览器的客户端抓取

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31581051/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-28 14:00:46  来源:igfitidea点击:

Browser-based client-side scraping

javascriptphpjqueryweb-scrapingphantomjs

提问by eozzy

I wonder if its possible to scrape an external (cross-domain) page through the user's IP?

我想知道是否可以通过用户的 IP 抓取外部(跨域)页面?

For a shopping comparison site, I need to scrape pages of an e-com site but several requests from the server would get me banned, so I'm looking for ways to do client-side scraping —?that is, request pages from the user's IP and send to server for processing.

对于购物比较网站,我需要抓取电子商务网站的页面,但是来自服务器的几次请求都会使我被禁止,因此我正在寻找进行客户端抓取的方法 - 即从网站请求页面用户的 IP 并发送到服务器进行处理。

回答by Johann Bauer

No, you won't be able to use the browser of your clients to scrape content from other websites using JavaScript because of a security measure called Same-origin policy.

不,您将无法使用客户端的浏览器从其他使用 JavaScript 的网站抓取内容,因为存在称为同源策略的安全措施。

There should be no way to circumvent this policy and that's for a good reason. Imagine you could instruct the browser of your visitors to do anything on any website. That's not something you want to happen automatically.

应该没有办法规避这项政策,这是有充分理由的。想象一下,您可以指示访问者的浏览器在任何网站上执行任何操作。这不是您想要自动发生的事情。

However, you could create a browser extension to do that. JavaScript browser extensions can be equipped with more privileges than regular JavaScript.

但是,您可以创建一个浏览器扩展来做到这一点。JavaScript 浏览器扩展可以配备比常规 JavaScript 更多的权限。

Adobe Flash has similar security features but I guess you could use Java (not JavaScript) to create a web-scraper that uses your user's IP address. Then again, you probably don't want to do that as Java plugins are considered insecure (and slow to load!) and not all users will even have it installed.

Adobe Flash 具有类似的安全功能,但我想您可以使用 Java(而不是 JavaScript)来创建一个使用您用户 IP 地址的网络爬虫。再说一次,您可能不想这样做,因为 Java 插件被认为是不安全的(并且加载速度很慢!),而且并非所有用户都会安装它。

So now back to your problem:

所以现在回到你的问题:

I need to scrape pages of an e-com site but several requests from the server would get me banned.

我需要抓取一个电子商务网站的页面,但是来自服务器的几个请求会让我被禁止。

If the owner of that website doesn't want you to use his service in that way, you probably shouldn't do it. Otherwise you would risk legal implications (look herefor details).

如果该网站的所有者不希望您以这种方式使用他的服务,您可能不应该这样做。否则,您将面临法律影响(查看此处了解详细信息)。

If you are on the "dark side of the law" and don't care if that's illegal or not, you could use something like http://luminati.io/to use IP adresses of real people.

如果您站在“法律的阴暗面”并且不在乎这是否违法,您可以使用诸如http://luminati.io/ 之类的东西来使用真人的 IP 地址。

回答by Flavien Volken

Basically browsers are made to avoid doing this…

基本上浏览器是为了避免这样做......

The solution everyone thinks about first:

大家首先想到的解决办法:

jQuery/JavaScript: accessing contents of an iframe

jQuery/JavaScript:访问 iframe 的内容

But it will not work in most cases with "recent" browsers (<10 years old)

但在大多数情况下它不适用于“最近的”浏览器(<10 年)

Alternatives are:

替代方案是:

  • Using the official apis of the server (if any)
  • Try finding if the server is providing a JSONP service (good luck)
  • Being on the same domain, try a cross site scripting (if possible, not very ethical)
  • Using a trusted relay or proxy (but this will still use your own ip)
  • Pretends you are a google web crawler (why not, but not very reliable and no warranties about it)
  • Use a hack to setup the relay / proxy on the client itself I can think about java or possibly flash. (will not work on most mobile devices, slow, and flash does have its own cross site limitations too)
  • Ask google or another search engine for getting the content (you might have then a problem with the search engine if you abuse of it…)
  • Just do this job by yourself and cache the answer, this in order to unload their server and decrease the risk of being banned.
  • Index the site by yourself (your own web crawler), then use your own indexed website. (depends on the source changes frequency) http://www.quora.com/How-can-I-build-a-web-crawler-from-scratch
  • 使用服务器的官方apis(如果有)
  • 尝试查找服务器是否提供 JSONP 服务(祝你好运)
  • 在同一个域中,尝试跨站点脚本(如果可能,不太道德)
  • 使用受信任的中继或代理(但这仍将使用您自己的 ip)
  • 假装你是一个谷歌网络爬虫(为什么不呢,但不是很可靠,也没有保证)
  • 使用 hack 在客户端本身上设置中继/代理我可以考虑 Java 或 Flash。(不适用于大多数移动设备,速度慢,而且 flash 也有自己的跨站点限制)
  • 要求谷歌或其他搜索引擎获取内容(如果你滥用它,你可能会遇到搜索引擎的问题......)
  • 只需自己完成这项工作并缓存答案,这是为了卸载他们的服务器并降低被禁止的风险。
  • 自己(您自己的网络爬虫)为网站建立索引,然后使用您自己的索引网站。(取决于源更改频率) http://www.quora.com/How-can-I-build-a-web-crawler-from-scratch

[EDIT]

[编辑]

One more solution I can think about is using going through a YQLservice, in this manner it is a bit like using a search engine / a public proxy as a bridge to retrieve the informations for you. Here is a simple example to do so, In short, you get cross domain GET requests

我能想到的一种解决方案是使用YQL服务,这种方式有点像使用搜索引擎/公共代理作为桥梁来为您检索信息。 这是一个简单的例子简而言之,你得到跨域的 GET 请求

回答by Jan

Have a look at http://import.io, they provide a couple of crawlers, connectors and extractors. I'm not pretty sure how they get around bans but they do somehow (we are using their system over a year now with no problems).

看看http://import.io,他们提供了几个爬虫、连接器和提取器。我不太确定他们是如何绕过禁令的,但他们以某种方式做了(我们现在使用他们的系统已经一年多了,没有任何问题)。

回答by user2816491

You could build an browser extension with artoo.

您可以使用 artoo 构建浏览器扩展。

http://medialab.github.io/artoo/chrome/

http://medialab.github.io/artoo/chrome/

That would allow you to get around the same orgin policy restrictions. It is all javascript and on the client side.

这将允许您绕过相同的来源政策限制。这都是 javascript 和客户端。