Javascript 如何使用 node.js 抓取包含动态内容的页面？

Question

提问by JayD

I am trying to scrape a websitebut I don't get some of the elements, because these elements are dynamically created.

我试图抓取一个网站，但我没有得到一些元素，因为这些元素是动态创建的。

I use the cheerio in node.js and My code is below.

我在 node.js 中使用了cheerio，我的代码如下。

var request = require('request');
var cheerio = require('cheerio');
var url = "http://www.bdtong.co.kr/index.php?c_category=C02";

request(url, function (err, res, html) {
    var $ = cheerio.load(html);
    $('.listMain > li').each(function () {
        console.log($(this).find('a').attr('href'));
    });
});

This code returns empty response, because when the page is loaded, the <ul id="store_list" class="listMain">is empty.

此代码返回空响应，因为当页面加载时，它<ul id="store_list" class="listMain">是空的。

The content has not been appended yet.

内容尚未附加。

How can I get these elements using node.js? How can I scrape pages with dynamic content?

如何使用 node.js 获取这些元素？如何抓取包含动态内容的页面？

Answer 1

回答by Safi

Here you go;

干得好;

var phantom = require('phantom');

phantom.create(function (ph) {
  ph.createPage(function (page) {
    var url = "http://www.bdtong.co.kr/index.php?c_category=C02";
    page.open(url, function() {
      page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() {
        page.evaluate(function() {
          $('.listMain > li').each(function () {
            console.log($(this).find('a').attr('href'));
          });
        }, function(){
          ph.exit()
        });
      });
    });
  });
});

Answer 2

回答by scniro

Check out GoogleChrome/puppeteer

查看GoogleChrome/puppeteer

Headless Chrome Node API

无头 Chrome 节点 API

It makes scraping pretty trivial. The following example will scrape the headline over at npmjs.com(assuming .npm-expansionsremains)

它使刮刮变得微不足道。以下示例将在npmjs.com 上抓取标题（假设.npm-expansions仍然存在）

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.goto('https://www.npmjs.com/');

  const textContent = await page.evaluate(() => {
    return document.querySelector('.npm-expansions').textContent
  });

  console.log(textContent); /* No Problem Mate */

  browser.close();
})();

evaluatewill allow for the inspection of the dynamic element as this will run scripts on the page.

evaluate将允许检查动态元素，因为这将在页面上运行脚本。

Answer 3

回答by Keng

Use the new npm module x-ray, with a pluggable web driver x-ray-phantom.

使用新的 npm 模块x-ray和可插拔的网络驱动程序x-ray-phantom。

Examples in the pages above, but here's how to do dynamic scraping:

上面页面中的示例，但这里是如何进行动态抓取：

var phantom = require('x-ray-phantom');
var Xray = require('x-ray');

var x = Xray()
  .driver(phantom());

x('http://google.com', 'title')(function(err, str) {
  if (err) return done(err);
  assert.equal('Google', str);
  done();
})

Answer 4

回答by Rohit Parte

Easiest and reliable solution is to use puppeteer. As mentioned in https://pusher.com/tutorials/web-scraper-nodewhich is suitable for both static + dynamicscrapping.

最简单可靠的解决方案是使用 puppeteer。如https://pusher.com/tutorials/web-scraper-node 中所述，它适用于静态 + 动态报废。

Only change the timeout in Browser.js, TimeoutSettings.js, Launcher.js 300000 to 3000000

仅将 Browser.js、TimeoutSettings.js、Launcher.js 中的超时时间从 300000 更改为 3000000

Javascript 如何使用 node.js 抓取包含动态内容的页面？

提问by JayD

回答by Safi

回答by scniro

回答by Keng

回答by Rohit Parte

相关推荐

最近更新

标签

Javascript 如何使用 node.js 抓取包含动态内容的页面？

提问by JayD

回答by Safi

回答by scniro

回答by Keng

回答by Rohit Parte

相关推荐

Javascript Discord.js 每隔 1 分钟发送一条消息

Javascript 在 Chart.js 中隐藏 y 轴上的标签

Javascript Firebase FCM 错误：“无效注册”

JavaScript 中的精确财务计算。什么是陷阱？

相关推荐

最近更新

标签