Google Chrome 扩展程序中的网页抓取(JavaScript + Chrome API)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6508393/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-23 22:06:45  来源:igfitidea点击:

Web Scraping in a Google Chrome Extension (JavaScript + Chrome APIs)

javascriptgoogle-chromegoogle-chrome-extensionxmlhttprequestweb-scraping

提问by Seb Nilsson

What are the best options for performing Web Scraping of a not currently open tab from within a Google Chrome Extensionwith JavaScript and whatever more technologies are available. Other JavaScript-libraries are also accepted.

使用 JavaScript的 Google Chrome 扩展程序中当前未打开的选项卡执行Web Scraping的最佳选择是什么?其他 JavaScript 库也被接受。

The important thing is to mask the scraping to behave like a normal web-request. No indications of AJAX or XMLHttpRequest, like X-Requested-With: XMLHttpRequestor Origin.

重要的是掩盖抓取行为,使其表现得像一个普通的 web-request。没有 AJAX 或 XMLHttpRequest 的迹象,例如X-Requested-With: XMLHttpRequestOrigin

The scraped content must be accessible from JavaScript for further manipulation and presentation within the extension, most probably as a string.

抓取的内容必须可以从 JavaScript 访问,以便在扩展中进一步操作和呈现,最有可能是字符串。

Are there any hooks in any WebKit/Chrome-specific API:s that can be used to make a normal web-request and get the results for manipulation?

在任何 WebKit/Chrome 特定的 API:s 中是否有任何钩子可用于发出正常的网络请求并获取操作结果?

var pageContent = getPageContent(url); // TODO: Implement
var items = $(pageContent).find('.item');
// Display items with further selections

Bonus-points to make this work from a local file on disk, for initial debugging. But if that is the only point is stopping a solution, then disregard the bonus-points.

从磁盘上的本地文件进行这项工作的奖励点,用于初始调试。但是,如果这是停止解决方案的唯一要点,那么请忽略奖励积分。

回答by Eli Grey

Attempt to use XHR2responseType = "document"and fall back on (new DOMParser).parseFromString(responseText, getResponseHeader("Content-Type"))with my text/htmlpatch. See https://gist.github.com/1138724for an example of how I detect responseType = "documentsupport (synchronously checking response === nullon an object URL created from a text/htmlblob).

尝试使用XHR2responseType = "document"和依傍(new DOMParser).parseFromString(responseText, getResponseHeader("Content-Type"))我的text/html补丁。有关我如何检测支持的示例(同步检查从blob创建的对象 URL ),请参见https://gist.github.com/1138724responseType = "documentresponse === nulltext/html

Use the Chrome WebRequest APIto hide X-Requested-With, etc. headers.

使用Chrome WebRequest API隐藏X-Requested-With等标题。

回答by Anshul

If you are fine looking at something beyond a Google Chrome Plugin, look at phantomjswhich uses Qt-Webkit in background and runs just like a browser incuding making ajax requests. You can call it a headless browser as it doesn't display the output on a screen and can quitely work in background while you are doing other stuff. If you want, you can export out images, pdf out of the pages it fetches. It provides JS interface to load pages, clicking on buttons etc much like you have in a browser. You can also inject custom JS for example jQuery on any of the pages you want to scrape and use it to access the dom and export out desired data. As its using Webkitits rendering behaviour is exactly like Google Chrome.

如果您对 Google Chrome 插件以外的东西很感兴趣,请查看phantomjs,它在后台使用 Qt-Webkit,并且像浏览器一样运行,包括发出 ajax 请求。您可以将其称为无头浏览器,因为它不会在屏幕上显示输出,并且可以在您执行其他操作时在后台正常工作。如果需要,您可以从它获取的页面中导出图像、pdf。它提供了 JS 接口来加载页面、点击按钮等,就像在浏览器中一样。您还可以在要抓取的任何页面上注入自定义 JS,例如 jQuery,并使用它来访问 dom 并导出所需的数据。由于它使用Webkit,它的渲染行为与 Google Chrome 完全一样。

Another option would be to use Aptana Jaxerwhich is based on Mozilla Engine and is very good concept in itself. It can be used as a simple scraping tool as well.

另一种选择是使用基于 Mozilla 引擎的Aptana Jaxer,它本身就是一个很好的概念。它也可以用作简单的抓取工具。

回答by potar

A lot of tools have been released since this question was asked.

自从提出这个问题以来,已经发布了很多工具。

artoo.jsis one of them. It's a piece of JavaScript code meant to be run in your browser's console to provide you with some scraping utilities. It can also be used as a chrome extension.

artoo.js就是其中之一。它是一段 JavaScript 代码,旨在在浏览器的控制台中运行,为您提供一些抓取实用程序。它也可以用作 chrome 扩展。

回答by Novikov

Web scraping is kind of convoluted in a Chrome Extension. Some points:

网页抓取在 Chrome 扩展中有点复杂。几点:

  • You run content scripts for access to the DOM.
  • Background pages (one per browser) can send and receive messagesto content scripts. That is, you can run a content script that sets up an RPC endpoint and fires a specified callback in the context of the background page as a response.
  • You can execute content scripts in all frames of a webpage, then stitch the document tree (composed of the 1..N frames that the page contains) together.
  • As S.K. suggested, your background page can send the data as an XMLHttpRequest to some kind of lightweight HTTP server that listens locally.
  • 您运行内容脚本以访问 DOM。
  • 后台页面(每个浏览器一个)可以向内容脚本发送和接收消息。也就是说,您可以运行一个内容脚本来设置 RPC 端点并在后台页面的上下文中触发指定的回调作为响应。
  • 您可以在网页的所有框架中执行内容脚本,然后将文档树(由页面包含的 1..N 个框架组成)拼接在一起。
  • 正如 SK 所建议的,您的后台页面可以将数据作为 XMLHttpRequest 发送到某种在本地侦听的轻量级 HTTP 服务器。

回答by Steve

I'm not sure it's entirely possible with just JavaScript, but if you can set up a dedicated PHP script for your extension that uses cURL to fetch the HTML for a page, the PHP script could scrape the page for you and your extension could read it in through an AJAX request.

我不确定仅使用 JavaScript 是否完全可能,但是如果您可以为您的扩展程序设置一个专用的 PHP 脚本,该脚本使用 cURL 来获取页面的 HTML,那么 PHP 脚本可以为您抓取页面,您的扩展程序可以读取它通过 AJAX 请求输入。

The actual page being scraped wouldn't know it's an AJAX request, however, because it is being accessed through cURL.

然而,被抓取的实际页面不会知道它是一个 AJAX 请求,因为它是通过 cURL 访问的。

回答by Dmitry Chichkov

I think you can start from this example.

我想你可以从这个例子开始。

So basically you can try using Extension + Plugin combination. Extension would have access to DOM (including plugin) and drive the process. And Plugin would send actual HTTP requests.

所以基本上你可以尝试使用扩展+插件组合。扩展可以访问 DOM(包括插件)并驱动这个过程。插件会发送实际的 HTTP 请求。

I can recommend using Firebreath as a crossplatform Chrome/Firefox plugin platform, in particular take a look at this example: Firebreath - Making+HTTP+Requests+with+SimpleStreamsHelper

我可以推荐使用 Firebreath 作为跨平台的 Chrome/Firefox 插件平台,特别是看看这个例子:Firebreath - Making+HTTP+Requests+with+SimpleStreamsHelper

回答by tim

couldn't you just do some iframe trickery? if you load the url into a dedicated frame, you have the dom in a document object and can do your jquery selections, no?

你就不能做一些 iframe 的诡计吗?如果将 url 加载到专用框架中,则文档对象中有 dom 并且可以进行 jquery 选择,不是吗?