Javascript 如何使用 Node.js 最有效地解析网页

Question

提问by NiLL

I need to parse a simple web page and get data from html, such as "src", "data-attr", etc. How can I do this most efficiently using Node.js? If it helps, I'm using Node.js 0.8.x.

我需要解析一个简单的网页并从 html 中获取数据，例如“src”、“data-attr”等。如何使用 Node.js 最有效地做到这一点？如果有帮助，我正在使用 Node.js 0.8.x。

P.S. This is the siteI'm parsing. I want to get a list of current tracks and make my own html5 app for listen on mobile devices.

PS 这是我正在解析的网站。我想获取当前曲目的列表并制作我自己的 html5 应用程序以在移动设备上收听。

Answer 1

回答by JP Richardson

I have done this a lot. You'll want to use PhantomJSif the website that you're scraping is heavily using JavaScript. Note that PhantomJS is not Node.js. It's a completely different JavaScript runtime. You can integrate through phantomjs-nodeor node-phantom, but they are both kinda hacky. YMMV with those. Avoid anything to do with jsdom. It'll cause you headaches - this includes Zombie.js.

我已经做了很多。如果您正在抓取的网站大量使用 JavaScript，您将需要使用PhantomJS。请注意 PhantomJS 不是 Node.js。这是一个完全不同的 JavaScript 运行时。您可以通过phantomjs-node或node-phantom进行集成，但它们都有些笨拙。YMMV 与那些。避免与 jsdom 有任何关系。它会让你头疼——这包括Zombie.js。

What you should use is Cheerioin conjunction with Request. This will be sufficient for most web pages.

您应该将Cheerio与Request结合使用。这对于大多数网页来说已经足够了。

I wrote a blog post on using Cheerio with Request: Quick and Dirty Screen Scraping with Node.jsBut, again, if it's JavaScript intensive, use PhantomJS in conjunction with CasperJS.

我写了一篇关于将 Cheerio 与 Request 一起使用的博客文章：使用 Node.js 进行快速和肮脏的屏幕抓取但是，同样，如果它是 JavaScript 密集型的，请将PhantomJS 与CasperJS结合使用。

Hope this helps.

希望这可以帮助。

Snippet using Request and Cheerio:

使用 Request 和 Cheerio 的片段：

var request = require('request')
  , cheerio = require('cheerio');

var searchTerm = 'screen+scraping';
var url = 'http://www.bing.com/search?q=' + searchTerm;

request(url, function(err, resp, body){
  $ = cheerio.load(body);
  links = $('.sb_tlst h3 a'); //use your CSS selector here
  $(links).each(function(i, link){
    console.log($(link).text() + ':\n  ' + $(link).attr('href'));
  });
});

Answer 2

回答by jabclab

You could try PhantomJS. Here's the documentationfor using it for screen scraping.

你可以试试PhantomJS。这是使用它进行屏幕抓取的文档。

Answer 3

回答by Max Heiber

I agree with @JP Richardson that Cheerio is best for scraping non-JS-heavy sites. For JS-heavy sites, use Casper. It provides great abstractions over Phantom and a promises-style API. They go over how to scrape in their docs: http://docs.casperjs.org/en/latest/quickstart.html.

我同意@JP Richardson 的观点，Cheerio 最适合抓取非 JS 密集型网站。对于 JS 密集型站点，请使用Casper。它为 Phantom 提供了很好的抽象，并提供了 Promise 风格的 API。他们讨论了如何在他们的文档中抓取：http: //docs.casperjs.org/en/latest/quickstart.html。

Answer 4

回答by Mustafa

If you want to go for phantom, use node-phantom. I have a git hub repository using them together to generate pdf files from html if you want to have a look. But i wouldn't go for phantom because it does more than what you usually want and cheerio is faster.

如果您想使用幻影，请使用节点幻影。如果您想查看，我有一个 git hub 存储库，将它们一起使用以从 html 生成 pdf 文件。但我不会选择幻影，因为它比你通常想要的要多，而且cheerio 速度更快。

Javascript 如何使用 Node.js 最有效地解析网页

提问by NiLL

回答by JP Richardson

回答by jabclab

回答by Max Heiber

回答by Mustafa

相关推荐

最近更新

标签

Javascript 如何使用 Node.js 最有效地解析网页

提问by NiLL

回答by JP Richardson

回答by jabclab

回答by Max Heiber

回答by Mustafa

相关推荐

用于更改 li 标签文本的 Javascript

Javascript 在 lodash 中添加对象的新属性

Javascript 如何使用javascript停止浏览器后退按钮

Javascript Angular 2.0 和 ng 风格

相关推荐

最近更新

标签