Javascript 使用 Node.js 实时抓取网页

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5211486/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-23 16:11:43  来源:igfitidea点击:

Scrape web pages in real time with Node.js

javascriptjquerynode.jsscreen-scrapingweb-scraping

提问by Avishai

What's a good was to scrape website content using Node.js. I'd like to build something very, very fast that can execute searches in the style of kayak.com, where one query is dispatched to several different sites, the results scraped, and returned to the client as they become available.

使用 Node.js 抓取网站内容有什么好处。我想构建一些非常非常快的东西,它可以以kayak.com的风格执行搜索,其中一个查询被分派到几个不同的站点,抓取结果,并在它们可用时返回给客户端。

Let's assume that this script should just provide the results in JSON format, and we can process them either directly in the browser or in another web application.

让我们假设这个脚本应该只提供 JSON 格式的结果,我们可以直接在浏览器或另一个 Web 应用程序中处理它们。

A few starting points:

几个出发点:

Using node.js and jquery to scrape websites

使用 node.js 和 jquery 抓取网站

Anybody have any ideas?

有人有任何想法吗?

回答by Avishai

Node.ioseems to take the cake :-)

Node.io似乎拿走了蛋糕 :-)

回答by Yevgeniy

All aforementioned solutions presume running the scraper locally. This means you will be severely limited in performance (due to running them in sequence or in a limited set of threads). A better approach, imho, is to rely on an existing, albeit commercial, scraping grid.

所有上述解决方案都假定在本地运行刮刀。这意味着您的性能将受到严重限制(由于按顺序或在有限的线程集中运行它们)。恕我直言,更好的方法是依赖现有的、尽管是商业的、抓取网格。

Here is an example:

下面是一个例子:

var bobik = new Bobik("YOUR_AUTH_TOKEN");
bobik.scrape({
  urls: ['amazon.com', 'zynga.com', 'http://finance.google.com/', 'http://shopping.yahoo.com'],
  queries:  ["//th", "//img/@src", "return document.title", "return $('script').length", "#logo", ".logo"]
}, function (scraped_data) {
  if (!scraped_data) {
    console.log("Data is unavailable");
    return;
  }
  var scraped_urls = Object.keys(scraped_data);
  for (var url in scraped_urls)
    console.log("Results from " + url + ": " + scraped_data[scraped_urls[url]]);
});

Here, scraping is performed remotely and a callback is issued to your code only when results are ready (there is also an option to collect results as they become available).

在这里,抓取是远程执行的,并且只有在结果准备好时才会向您的代码发出回调(还有一个选项可以在结果可用时收集结果)。

You can download Bobik client proxy SDKat https://github.com/emirkin/bobik_javascript_sdk

您可以在https://github.com/emirkin/bobik_javascript_sdk下载Bobik 客户端代理 SDK

回答by electblake

I've been doing research myself, and https://npmjs.org/package/wscraperboasts itself as a

我自己一直在做研究,https://npmjs.org/package/wscraper 自称是

a web scraper agent based on cheerio.js a fast, flexible, and lean implementation of core jQuery; built on top of request.js; inspired by http-agent.js

基于cheerio.js 的网络爬虫代理,是核心jQuery 的快速、灵活和精益实现;建立在 request.js 之上;灵感来自 http-agent.js

Very low usage (according to npmjs.org) but worth a look for any interested parties.

使用率非常低(根据 npmjs.org),但值得任何感兴趣的人查看。

回答by daithi44

You don't always need to jQuery. If you play with the DOM returned from jsdom for example you can easily take what you need yourself (also considering you dont have to worry about xbrowser issues.) See: https://gist.github.com/1335009that's not taking away from node.io at all, just saying you might be able to do it yourself depending...

你并不总是需要 jQuery。如果你从jsdom返回例如DOM玩你可以很容易地采取什么样的,你需要看到自己(也考虑到你没有浏览器之间的问题的担心。):https://gist.github.com/1335009这不是从带走根本没有node.io,只是说你可以自己做这取决于......

回答by Evan Carroll

The new way using ES7/promises

使用 ES7/promises 的新方法

Usually when you're scraping you want to use some method to

通常,当您进行刮擦时,您想使用某种方法来

  1. Get the resource on the webserver (html document usually)
  2. Read that resource and work with it as
    1. A DOM/tree structure and make it navigable
    2. parse it as token-document with something like SAS.
  1. 获取网络服务器上的资源(通常为 html 文档)
  2. 阅读该资源并将其作为
    1. DOM/树结构并使其可导航
    2. 使用 SAS 之类的东西将其解析为令牌文档。

Both tree, and token-parsing have advantages, but tree is usuallysubstantially simpler. We'll do that. Check out request-promise, here is how it works:

树和标记解析都有优点,但树通常要简单得多。我们会这样做。查看request-promise,这是它的工作原理:

const rp = require('request-promise');
const cheerio = require('cheerio'); // Basically jQuery for node.js 

const options = {
    uri: 'http://www.google.com',
    transform: function (body) {
        return cheerio.load(body);
    }
};

rp(options)
    .then(function ($) {
        // Process html like you would with jQuery... 
    })
    .catch(function (err) {
        // Crawling failed or Cheerio 

This is using cheeriowhich is essentially a lightweight server-side jQuery-esque library (that doesn't need a window object, or jsdom).

这是使用cheerio,它本质上是一个轻量级的服务器端jQuery-esque 库(不需要窗口对象或jsdom)。

Because you're using promises, you can also write this in an asychronous function. It'll look synchronous, but it'll be asynchronous with ES7:

因为您使用的是 Promise,所以您也可以在异步函数中编写它。它看起来是同步的,但它与 ES7 是异步的:

async function parseDocument() {
    let $;
    try {
      $ = await rp(options);
    } catch (err) { console.error(err); }

    console.log( $('title').text() ); // prints just the text in the <title>
}

回答by browserless

I see most answers the right path with cheerioand so forth, however once you get to the point where you needto parse and execute JavaScript (ala SPA's and more), then I'd check out https://github.com/joelgriffith/navalia(I'm the author). Navalia is built to support scraping in a headless-browser context, and it's pretty quick. Thanks!

我看到大多数答案都是正确的,cheerio等等,但是一旦你到了需要解析和执行 JavaScript(ala SPA 等)的地步,那么我会查看https://github.com/joelgriffith/纳维亚(我是作者)。Navalia 旨在支持无头浏览器环境中的抓取,而且速度非常快。谢谢!

回答by harish2704

It is my easy to use general purpose scrapper https://github.com/harish2704/html-scrapperwritten for Node.JS It can extract information based on predefined schemas. A schema defnition includes a css selector and a data extraction function. It currently using cheerio for dom parsing..

这是我为 Node.JS 编写的易于使用的通用刮板https://github.com/harish2704/html-scrapper它可以根据预定义的模式提取信息。模式定义包括一个 css 选择器和一个数据提取函数。它目前使用cheerio进行dom解析..

回答by user3723412

check out https://github.com/rc0x03/node-promise-parser

查看https://github.com/rc0x03/node-promise-parser

Fast: uses libxml C bindings
Lightweight: no dependencies like jQuery, cheerio, or jsdom
Clean: promise based interface- no more nested callbacks
Flexible: supports both CSS and XPath selectors