node.js 如何管理 PhantomJS 实例的“池”
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/9961254/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to manage a 'pool' of PhantomJS instances
提问by Trindaz
I'm planning a webservice for my own use internally that takes one argument, a URL, and returns html representing the resolvedDOM from that URL. By resolved I mean that the webservice will firstly get the page at that URL, then use PhantomJS to 'render' the page, and then return the resulting source after all DHTML, AJAX calls etc are executed. However launching phantom on a per-request basis (which I'm doing now) is waytoo sluggish. I would rather have a pool of PhantomJS instances with one always available to serve the latest call to my webservice.
我正在计划一个自己在内部使用的网络服务,它接受一个参数,一个 URL,并返回表示从该 URL解析的DOM 的html 。解析我的意思是 web 服务将首先获取该 URL 上的页面,然后使用 PhantomJS 来“呈现”页面,然后在执行所有 DHTML、AJAX 调用等后返回结果源。但是在每个请求的基础(这是我现在做)推出幽灵方式过于缓慢。我宁愿拥有一个 PhantomJS 实例池,其中一个始终可用以服务对我的 Web 服务的最新调用。
Has any work been done on this kind of thing before? I'd rather base this webservice on the work of others than write a pool manager / http proxy server for myself from scratch.
以前有没有做过这方面的工作?我宁愿将此网络服务基于其他人的工作,也不愿从头开始为自己编写池管理器/http 代理服务器。
More Context: I've listed the 2 similar projects that I've seen so far below and why I've avoided each one, resulting in this question about managing a pool of PhantomJS instances instead.
更多背景信息:我在下面列出了到目前为止我看到的 2 个类似项目,以及为什么我避免了每个项目,从而产生了有关管理 PhantomJS 实例池的问题。
jsdom - from what I've seen it has great functionality for executing scripts on a page, but it doesn't attempt to replicate browser behaviour, so if I were use it as a general purpose "DOM resolver" there'd end up being a lot of extra coding to handle all kinds of edges cases, event calling, etc. The first example I saw was having to manually call the onload() function of the body tag for a test app I set up using node. It seemed like the beginning of a deep rabbit hole.
jsdom - 从我所见,它具有在页面上执行脚本的强大功能,但它不会尝试复制浏览器行为,因此如果我将其用作通用“DOM 解析器”,最终会成为许多额外的编码来处理各种边缘情况、事件调用等。我看到的第一个例子是必须为我使用 node.js 设置的测试应用程序手动调用 body 标签的 onload() 函数。这似乎是一个深兔子洞的开始。
Selenium - It just has soo many more moving parts, so setting up a pool to manage long lived browser instances will just be more complicated than using PhantomJS. I don't need any of it's macro recording / scripting benefits. I just want a webservice that is as performant at getting a webpage and resolving it's DOM as if I were browsing to that URL with a browser (or even faster if I can make it ignore images etc.)
Selenium - 它只是有更多的活动部件,因此设置一个池来管理长期存在的浏览器实例将比使用 PhantomJS 更复杂。我不需要任何宏录制/脚本编写的好处。我只想要一个 web 服务,它在获取网页和解析它的 DOM 方面的性能与我使用浏览器浏览到那个 URL 一样(或者,如果我可以让它忽略图像等,速度甚至更快)
回答by JasonS
I setup a PhantomJs Cloud Service, and it pretty much does what you are asking. It took me about 5 weeks of work implement.
我设置了 PhantomJs 云服务,它几乎可以满足您的要求。我花了大约 5 周的时间来完成工作。
The biggest problem you'll run into is the known-issue of memory leaks in PhantomJs. The way I worked around this is to cycle my instances every 50 calls.
您将遇到的最大问题是PhantomJs 中已知的内存泄漏问题。我解决这个问题的方法是每 50 次调用循环一次我的实例。
The second biggest problem you'll run into is per-page processing is very cpu and memory intensive, so you'll only be able to run 4 or so instances per CPU.
您将遇到的第二大问题是每页处理非常占用 CPU 和内存,因此每个 CPU 只能运行 4 个左右的实例。
The third biggest problem you'll run into is that PhantomJs is pretty wacky with page-finish events and redirects. You'll be informed that your page is finished rendering before it actually is. There are a number of ways to deal with this, but nothing 'standard' unfortunately.
您将遇到的第三大问题是 PhantomJs 在页面完成事件和重定向方面非常古怪。你会被告知你的页面在实际渲染之前已经完成。 有很多方法可以解决这个问题,但不幸的是,没有任何“标准”。
The fourth biggest problem you'll have to deal with is interop between nodejs and phantomjs thankfully there are a lot of npm packages that deal with this issueto choose from.
您必须处理的第四大问题是 nodejs 和 phantomjs 之间的互操作,谢天谢地,有很多 npm 包可以处理这个问题可供选择。
So I know I'm biased (as I wrote the solution I'm going to suggest) but I suggest you check out PhantomJsCloud.comwhich is free for light usage.
所以我知道我有偏见(因为我写了我要建议的解决方案)但我建议你查看PhantomJsCloud.com,它是免费的,可以免费使用。
Jan 2015 update:Another (5th?) big problem I ran into is how to send the request/response from the manager/load-balancer. Originally I was using PhantomJS's built-in HTTP server, but kept running into it's limitations, especially regarding maximum response-size. I ended up writing the request/response to the local file-system as the lines of communication. * Total time spent on implementation of the service represents perhaps 20 man-weeks issues is perhaps 1000 hours of work. *and FYI I am doing a complete rewrite for the next version.... (in-progress)
2015 年 1 月更新:我遇到的另一个(第 5 个?)大问题是如何从管理器/负载平衡器发送请求/响应。最初我使用 PhantomJS 的内置 HTTP 服务器,但一直遇到它的限制,特别是关于最大响应大小。我最终将请求/响应写入本地文件系统作为通信线路。 * 用于实施服务的总时间可能代表 20 人周的问题,可能是 1000 小时的工作。*仅供参考,我正在为下一个版本进行完全重写....(进行中)
回答by Michelle Tilley
The async JavaScript libraryworks in Node and has a queuefunction that is quite handy for this kind of thing:
在异步JavaScript库工程节点和具有queue功能对这种事情非常方便:
queue(worker, concurrency)Creates a queue object with the specified concurrency. Tasks added to the queue will be processed in parallel (up to the concurrency limit). If all workers are in progress, the task is queued until one is available. Once a worker has completed a task, the task's callback is called.
queue(worker, concurrency)创建具有指定并发性的队列对象。添加到队列中的任务将被并行处理(达到并发限制)。如果所有工作人员都在进行中,则任务将排队等待,直到有一个可用。一旦工作人员完成了任务,就会调用任务的回调。
Some pseudocode:
一些伪代码:
function getSourceViaPhantomJs(url, callback) {
var resultingHtml = someMagicPhantomJsStuff(url);
callback(null, resultingHtml);
}
var q = async.queue(function (task, callback) {
// delegate to a function that should call callback when it's done
// with (err, resultingHtml) as parameters
getSourceViaPhantomJs(task.url, callback);
}, 5); // up to 5 PhantomJS calls at a time
app.get('/some/url', function(req, res) {
q.push({url: params['url_to_scrape']}, function (err, results) {
res.end(results);
});
});
Check out the entire documentation for queueat the project's readme.
回答by Thomas Dondorf
For my master thesis, I developed the library phantomjs-poolwhich does exactly this. It allows to provide jobs which are then mapped to PhantomJS workers. The library handles the job distribution, communication, error handling, logging, restarting and some more stuff. The library was successfully used to crawl more than one million pages.
对于我的硕士论文,我开发了phantomjs-pool 库,它正是这样做的。它允许提供然后映射到 PhantomJS 工作人员的工作。该库处理作业分配、通信、错误处理、日志记录、重新启动等等。该库已成功用于抓取超过一百万页。
Example:
例子:
The following code executes a Google search for the numbers 0 to 9 and saves a screenshot of the page as googleX.png. Four websites are crawled in parallel (due to the creation of four workers). The script is started via node master.js.
以下代码对数字 0 到 9 执行 Google 搜索,并将页面截图保存为googleX.png。四个网站被并行抓取(由于创建了四个工人)。该脚本通过node master.js.
master.js(runs in the Node.js environment)
master.js(在 Node.js 环境中运行)
var Pool = require('phantomjs-pool').Pool;
var pool = new Pool({ // create a pool
numWorkers : 4, // with 4 workers
jobCallback : jobCallback,
workerFile : __dirname + '/worker.js', // location of the worker file
phantomjsBinary : __dirname + '/path/to/phantomjs_binary' // either provide the location of the binary or install phantomjs or phantomjs2 (via npm)
});
pool.start();
function jobCallback(job, worker, index) { // called to create a single job
if (index < 10) { // index is count up for each job automatically
job(index, function(err) { // create the job with index as data
console.log('DONE: ' + index); // log that the job was done
});
} else {
job(null); // no more jobs
}
}
worker.js(runs in the PhantomJS environment)
worker.js(在 PhantomJS 环境中运行)
var webpage = require('webpage');
module.exports = function(data, done, worker) { // data provided by the master
var page = webpage.create();
// search for the given data (which contains the index number) and save a screenshot
page.open('https://www.google.com/search?q=' + data, function() {
page.render('google' + data + '.png');
done(); // signal that the job was executed
});
};
回答by TTT
As an alternative to @JasonS great answer you can try PhearJS, which I built. PhearJS is a supervisor written in NodeJS for PhantomJS instances and provides an API via HTTP. It is available open-source from Github.
作为@JasonS 很好答案的替代方案,您可以尝试我构建的PhearJS。PhearJS 是用 NodeJS 编写的用于 PhantomJS 实例的主管,并通过 HTTP 提供 API。它可以从Github开源。
回答by Shawn Liu
if you are using nodejs why not use selenium-webdriver
如果您使用的是 nodejs,为什么不使用 selenium-webdriver
- run some phantomjs instance as webdriver
phantomjs --webdriver=port_number for each phantomjs instance create PhantomInstance
function PhantomInstance(port) { this.port = port; } PhantomInstance.prototype.getDriver = function() { var self = this; var driver = new webdriver.Builder() .forBrowser('phantomjs') .usingServer('http://localhost:'+self.port) .build(); return driver; }and put all of them to one array [phantomInstance1,phantomInstance2]
create dispather.js that get free phantomInstance from array and
var driver = phantomInstance.getDriver();
- 运行一些 phantomjs 实例作为 webdriver
phantomjs --webdriver=port_number 为每个 phantomjs 实例创建 PhantomInstance
function PhantomInstance(port) { this.port = port; } PhantomInstance.prototype.getDriver = function() { var self = this; var driver = new webdriver.Builder() .forBrowser('phantomjs') .usingServer('http://localhost:'+self.port) .build(); return driver; }并将它们全部放入一个数组 [phantomInstance1,phantomInstance2]
创建 dispather.js 从数组中获取免费的 phantomInstance 和
var driver = phantomInstance.getDriver();
回答by thisisnotadisplayname
If you are using nodejs, you can use https://github.com/sgentle/phantomjs-node, which will allow you to connect an arbitrary number of phantomjs process to your main NodeJS process, hence, the ability to use async.js and lots of node goodies.
如果您使用的是 nodejs,您可以使用https://github.com/sgentle/phantomjs-node,这将允许您将任意数量的 phantomjs 进程连接到您的主要 NodeJS 进程,因此,能够使用 async.js和许多节点好东西。

