Javascript 如何使用 Node.js 最有效地解析网页
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/12403833/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to most efficiently parse a web page using Node.js
提问by NiLL
I need to parse a simple web page and get data from html, such as "src", "data-attr", etc. How can I do this most efficiently using Node.js? If it helps, I'm using Node.js 0.8.x.
我需要解析一个简单的网页并从 html 中获取数据,例如“src”、“data-attr”等。如何使用 Node.js 最有效地做到这一点?如果有帮助,我正在使用 Node.js 0.8.x。
P.S. This is the siteI'm parsing. I want to get a list of current tracks and make my own html5 app for listen on mobile devices.
PS 这是我正在解析的网站。我想获取当前曲目的列表并制作我自己的 html5 应用程序以在移动设备上收听。
回答by JP Richardson
I have done this a lot. You'll want to use PhantomJSif the website that you're scraping is heavily using JavaScript. Note that PhantomJS is not Node.js. It's a completely different JavaScript runtime. You can integrate through phantomjs-nodeor node-phantom, but they are both kinda hacky. YMMV with those. Avoid anything to do with jsdom. It'll cause you headaches - this includes Zombie.js.
我已经做了很多。如果您正在抓取的网站大量使用 JavaScript,您将需要使用PhantomJS。请注意 PhantomJS 不是 Node.js。这是一个完全不同的 JavaScript 运行时。您可以通过phantomjs-node或node-phantom进行集成,但它们都有些笨拙。YMMV 与那些。避免与 jsdom 有任何关系。它会让你头疼——这包括Zombie.js。
What you should use is Cheerioin conjunction with Request. This will be sufficient for most web pages.
您应该将Cheerio与Request结合使用。这对于大多数网页来说已经足够了。
I wrote a blog post on using Cheerio with Request: Quick and Dirty Screen Scraping with Node.jsBut, again, if it's JavaScript intensive, use PhantomJS in conjunction with CasperJS.
我写了一篇关于将 Cheerio 与 Request 一起使用的博客文章:使用 Node.js 进行快速和肮脏的屏幕抓取但是,同样,如果它是 JavaScript 密集型的,请将PhantomJS 与CasperJS结合使用。
Hope this helps.
希望这可以帮助。
Snippet using Request and Cheerio:
使用 Request 和 Cheerio 的片段:
var request = require('request')
, cheerio = require('cheerio');
var searchTerm = 'screen+scraping';
var url = 'http://www.bing.com/search?q=' + searchTerm;
request(url, function(err, resp, body){
$ = cheerio.load(body);
links = $('.sb_tlst h3 a'); //use your CSS selector here
$(links).each(function(i, link){
console.log($(link).text() + ':\n ' + $(link).attr('href'));
});
});
回答by jabclab
You could try PhantomJS. Here's the documentationfor using it for screen scraping.
回答by Max Heiber
I agree with @JP Richardson that Cheerio is best for scraping non-JS-heavy sites. For JS-heavy sites, use Casper. It provides great abstractions over Phantom and a promises-style API. They go over how to scrape in their docs: http://docs.casperjs.org/en/latest/quickstart.html.
我同意@JP Richardson 的观点,Cheerio 最适合抓取非 JS 密集型网站。对于 JS 密集型站点,请使用Casper。它为 Phantom 提供了很好的抽象,并提供了 Promise 风格的 API。他们讨论了如何在他们的文档中抓取:http: //docs.casperjs.org/en/latest/quickstart.html。
回答by Mustafa
If you want to go for phantom, use node-phantom. I have a git hub repository using them together to generate pdf files from html if you want to have a look. But i wouldn't go for phantom because it does more than what you usually want and cheerio is faster.
如果您想使用幻影,请使用节点幻影。如果您想查看,我有一个 git hub 存储库,将它们一起使用以从 html 生成 pdf 文件。但我不会选择幻影,因为它比你通常想要的要多,而且cheerio 速度更快。