node.js 抓取网页并通过单击按钮进行导航
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18160635/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Scrape a webpage and navigate by clicking buttons
提问by user2129794
I want to perform following actions at the server side:
我想在服务器端执行以下操作:
1) Scrape a webpage
2) Simulate a click on that page and then navigate to the new page.
3) Scrape the new page
4) Simulate some button clicks on the new page
5) Sending the data back to the client via json or something
1) 抓取网页
2) 模拟点击该页面,然后导航到新页面。
3) 抓取新页面
4) 模拟新页面上的一些按钮点击
5) 通过 json 或其他方式将数据发送回客户端
I am thinking of using it with Node.js.
我正在考虑将它与 Node.js 一起使用。
But am confused as to which module should i use
a) Zombie
b) Node.io
c) Phantomjs
d) JSDOM
e) Anything else
但是我不
知道我应该使用哪个模块
a) Zombie
b) Node.io
c) Phantomjs
d) JSDOM
e) 其他任何东西
I have installed node,io but am not able to run it via command prompt.
我已经安装了 node,io 但无法通过命令提示符运行它。
PS: I am working in windows 2008 server
PS:我在 windows 2008 服务器上工作
回答by danielepolencic
Zombie.js and Node.io run on JSDOM, hence your options are either going with JSDOM (or any equivalent wrapper), a headless browser (PhantomJS, SlimerJS) or Cheerio.
Zombie.js 和 Node.io 在 JSDOM 上运行,因此您可以选择使用 JSDOM(或任何等效的包装器)、无头浏览器(PhantomJS、SlimerJS)或 Cheerio。
- JSDOM is fairly slow because it has to recreate DOM and CSSOM in Node.js.
- PhantomJS/SlimerJS are proper headless browsers, thus performances are ok and those are also very reliable.
- Cheeriois a lightweight alternative to JSDOM. It doesn't recreate the entire page in Node.js (it just downloads and parses the DOM - no javascript is executed). Therefore you can't really click on buttons/links, but it's very fast to scrape webpages.
- JSDOM 相当慢,因为它必须在 Node.js 中重新创建 DOM 和 CSSOM。
- PhantomJS/SlimerJS 是合适的无头浏览器,因此性能还可以,也非常可靠。
- Cheerio是 JSDOM 的轻量级替代品。它不会在 Node.js 中重新创建整个页面(它只是下载并解析 DOM - 没有执行 javascript)。因此,您无法真正点击按钮/链接,但抓取网页的速度非常快。
Given your requirements, I'd probably go with something like a headless browser. In particular, I'd choose CasperJSbecause it has a nice and expressive API, it's fast and reliable (it doesn't need to reinvent the wheel on how to parse and render the dom or css like JSDOM does) and it's very easy to interact with elements such as buttons and links.
鉴于您的要求,我可能会使用无头浏览器之类的东西。特别是,我会选择CasperJS,因为它有一个很好的和富有表现力的 API,它快速可靠(它不需要像 JSDOM 那样重新发明如何解析和呈现 dom 或 css)而且很容易与按钮和链接等元素进行交互。
Your workflow in CasperJS should look more or less like this:
您在 CasperJS 中的工作流程应该或多或少是这样的:
casper.start();
casper
.then(function(){
console.log("Start:");
})
.thenOpen("https://www.domain.com/page1")
.then(function(){
// scrape something
this.echo(this.getHTML('h1#foobar'));
})
.thenClick("#button1")
.then(function(){
// scrape something else
this.echo(this.getHTML('h2#foobar'));
})
.thenClick("#button2")
thenOpen("http://myserver.com", {
method: "post",
data: {
my: 'data',
}
}, function() {
this.echo("data sent back to the server")
});
casper.run();
回答by Thomas Dondorf
Short answer (in 2019): Use puppeteer
简短回答(2019 年):使用 puppeteer
If you need a full (headless) browser, use puppeteerinstead of PhantomJS as it offers an up-to-date Chromium browser with a rich APIto automate any browser crawling and scraping tasks. If you only want to parse a HTML document (without executing JavaScript inside the page) you should check out jsdomand cheerio.
如果您需要一个完整的(无头)浏览器,请使用puppeteer而不是 PhantomJS,因为它提供了一个带有丰富API的最新 Chromium 浏览器,可以自动执行任何浏览器抓取和抓取任务。如果您只想解析 HTML 文档(而不在页面内执行 JavaScript),您应该查看jsdom和cheerio。
Explanation
解释
Tools like jsdom(or cheerio) allow it to extract information from a HTML document by parsing it. This is fast and works well as long as the website does not contain JavaScript. It will be very hard or even impossible to extract information from a website built on JavaScript. jsdom, for example, is able to execute scripts, but runs them inside a sandbox in your Node.js environment, which can be very dangerous and possibly crash your application. To quote the docs:
jsdom(或cheerio)等工具允许它通过解析从HTML 文档中提取信息。只要网站不包含 JavaScript,这就会很快并且运行良好。从基于 JavaScript 的网站中提取信息将非常困难甚至不可能。例如,jsdom 能够执行脚本,但在您的 Node.js 环境中的沙箱中运行它们,这可能非常危险并可能使您的应用程序崩溃。引用文档:
However, this is also highly dangerous when dealing with untrusted content.
但是,这在处理不受信任的内容时也是非常危险的。
Therefore, to reliably crawl more complex websites, you need an actual browser. For years, the most popular solution for this task was PhantomJS. But in 2018, the development of PhantomJS was offically suspended. Thankfully, since April 2017 the Google Chrome team makes it possible to run the Chrome browser headlessly (announcement). This makes it possible to crawl websites using an up-to-date browser with full JavaScript support.
因此,要可靠地抓取更复杂的网站,您需要一个实际的浏览器。多年来,此任务最流行的解决方案是PhantomJS。但在 2018 年,PhantomJS 的开发被正式暂停。值得庆幸的是,自 2017 年 4 月以来,Google Chrome 团队使无头运行 Chrome 浏览器成为可能(公告)。这使得使用具有完整 JavaScript 支持的最新浏览器来抓取网站成为可能。
To control the browser, the library puppeteer, which is also maintained by Google developers, offers a rich APIfor use within the Node.js environment.
为了控制浏览器,库puppeteer也由 Google 开发人员维护,它提供了丰富的API以在 Node.js 环境中使用。
Code sample
代码示例
The lines below, show a simple example. It uses Promises and the async/await syntax to execute a number of tasks. First, the browser is started (puppeteer.launch) and a URL is opened page.goto.
After that, a functions like page.evaluateand page.clickare used to extract information and execute actions on the page. Finally, the browser is closed (browser.close).
下面的几行显示了一个简单的例子。它使用 Promises 和 async/await 语法来执行许多任务。首先,启动浏览器 ( puppeteer.launch) 并打开一个 URL page.goto。之后,像page.evaluate和这样的函数page.click用于提取信息并在页面上执行操作。最后,浏览器关闭(browser.close)。
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
// example: get innerHTML of an element
const someContent = await page.$eval('#selector', el => el.innerHTML);
// Use Promise.all to wait for two actions (navigation and click)
await Promise.all([
page.waitForNavigation(), // wait for navigation to happen
page.click('a.some-link'), // click link to cause navigation
]);
// another example, this time using the evaluate function to return innerText of body
const moreContent = await page.evaluate(() => document.body.innerText);
// click another button
await page.click('#button');
// close brower when we are done
await browser.close();
})();
回答by user568109
The modules you listed do the following:
您列出的模块执行以下操作:
- Phantomjs/Zombie - simulate browser (headless - nothing is actually displayed). Can be used for scraping static or dynamic. Or testing of your html pages.
- Node.io/jsdom - webscraping : extracting data from page (static).
- Phantomjs/Zombie - 模拟浏览器(无头 - 实际上没有显示任何内容)。可用于抓取静态或动态。或测试您的 html 页面。
- Node.io/jsdom - 网页抓取:从页面中提取数据(静态)。
Looking at your requirements, you could use phantom or zombie.
查看您的要求,您可以使用幻影或僵尸。

