javascript 采取可靠的网站截图?Phantomjs 和 Casperjs 在某些网站上都返回空屏幕截图
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/26517852/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Taking reliable screenshots of websites? Phantomjs and Casperjs both return empty screen shots on some websites
提问by fabbb
Open a web page and take a screenshot.
打开一个网页并截取屏幕截图。
Using ONLY phantomjs: (this is a simple script, in fact it is the example script used in their docs. http://phantomjs.org/screen-capture.html
只使用phantomjs:(这是一个简单的脚本,实际上它是在他们的文档中使用的示例脚本。http://phantomjs.org/screen-capture.html
var page = require('webpage').create();
page.open('http://github.com/', function() {
page.render('github.png');
phantom.exit();
});
Problem is that for some websites (like github) funny enough are somehow detecting and not serving phantomjs and nothing is being rendered. Result is github.png
is a blank white png file.
问题是,对于某些网站(如 github)来说,有趣的是以某种方式检测而不提供 phantomjs 并且没有呈现任何内容。结果是github.png
一个空白的白色 png 文件。
Replace github with say: "google.com" and you get a nice (proper) screenshot as is intended.
将 github 替换为:“google.com”,您将获得预期的漂亮(正确)屏幕截图。
At first I thought this was a Phantomjs issue so I tried running it through Casperjs with:
起初我认为这是 Phantomjs 的问题,所以我尝试通过 Casperjs 运行它:
casper.start('http://www.github.com/', function() {
this.captureSelector('github.png', 'body');
});
casper.run();
But I get same behavior as with Phantomjs.
但我的行为与 Phantomjs 相同。
So I figured ok this is most likely a user agent issue. As in: Github sniffs out Phantomjs and decides not to show the page. So I set the user agent like below but that still didn't work.
所以我认为这很可能是用户代理问题。如:Github 嗅出 Phantomjs 并决定不显示该页面。所以我设置了如下所示的用户代理,但这仍然不起作用。
var page = require('webpage').create();
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36';
page.open('http://github.com/', function() {
page.render('github.png');
phantom.exit();
});
So then I tried to parse the page and apparently some sites (again like github) don't appear to be sending anything down the wire.
然后我尝试解析页面,显然有些站点(再次像 github)似乎没有发送任何内容。
Using casperjs I tried to print the title. And for google.com I got back Google
but for github.com I got back bupkis. Example code:
使用 casperjs 我尝试打印标题。对于 google.com 我回来了,Google
但对于 github.com 我回来了 bupkis。示例代码:
var casper = require('casper').create();
casper.start('http://github.com/', function() {
this.echo(this.getTitle());
});
casper.run();
The same as above also produces the same result in purely phantomjs.
同上在纯phantomjs中也产生同样的结果。
Update:
更新:
Could this be a timing issue? Is github just super slow? I doubt it but lets test anyway..
这可能是时间问题吗?github 是不是超级慢?我对此表示怀疑,但无论如何让我们测试一下..
var page = require('webpage').create();
page.open('http://github.com', function (status) {
/* irrelevant */
window.setTimeout(function () {
page.render('github.png');
phantom.exit();
}, 3000);
});
And the result is still bupkis. So no it's not a timing issue.
结果仍然是bupkis。所以不,这不是时间问题。
- How are some sites like github blocking phantomjs?
- How can we reliably take screenshots of ALL webpages? Required to be fast, and headless.
- 像 github 这样的网站如何阻止 phantomjs?
- 我们如何可靠地截取所有网页的截图?要求快速,无头。
回答by fabbb
After bouncing this around for some time I was able to narrow down the problem. Apparently PhantomJS uses a default ssl of sslv3
which causes github to refuse the connection due to a bad ssl handshake
经过一段时间的弹跳之后,我能够缩小问题的范围。显然 PhantomJS 使用默认的 sslsslv3
导致 github 由于糟糕的 ssl 握手而拒绝连接
phantomjs --debug=true github.js
Shows output of:
显示输出:
. . .
2014-10-22T19:48:31 [DEBUG] WebPage - updateLoadingProgress: 10
2014-10-22T19:48:32 [DEBUG] Network - Resource request error: 6 ( "SSL handshake failed" ) URL: "https://github.com/"
2014-10-22T19:48:32 [DEBUG] WebPage - updateLoadingProgress: 100
So from this we can conclude that no screen was taken because github was refusing the connection. Great that makes perfect sense. So let's set SSL flag to --ssl-protocol=any
and lets also ignore ssl-errors with --ignore-ssl-errors=true
所以由此我们可以得出结论,没有截屏是因为 github 拒绝了连接。太棒了,完全有道理。因此,让我们将 SSL 标志设置为--ssl-protocol=any
并忽略 ssl 错误--ignore-ssl-errors=true
phantomjs --ignore-ssl-errors=true --ssl-protocol=any --debug=true github.js
Great success!A screenshot is now being rendered and saved properly but debugger is showing us a TypeError:
巨大的成功!现在正在渲染并正确保存屏幕截图,但调试器向我们显示了一个类型错误:
TypeError: 'undefined' is not a function (evaluating 'Array.prototype.forEach.call.bind(Array.prototype.forEach)')
https://assets-cdn.github.com/assets/frameworks-dabc650f8a51dffd1d4376a3522cbda5536e4807e01d2a86ff7e60d8d6ee3029.js:29
https://assets-cdn.github.com/assets/frameworks-dabc650f8a51dffd1d4376a3522cbda5536e4807e01d2a86ff7e60d8d6ee3029.js:29
2014-10-22T19:52:32 [DEBUG] WebPage - updateLoadingProgress: 72
2014-10-22T19:52:32 [DEBUG] WebPage - updateLoadingProgress: 88
ReferenceError: Can't find variable: $
https://assets-cdn.github.com/assets/github-fa2f009761e3bc4750ed00845b9717b09646361cbbc3fa473ad64de9ca6ccf5b.js:1
https://assets-cdn.github.com/assets/github-fa2f009761e3bc4750ed00845b9717b09646361cbbc3fa473ad64de9ca6ccf5b.js:1
I checked the github homepage manually just to see if a TypeError existed and it does NOT.
我手动检查了 github 主页只是为了查看是否存在 TypeError 而它不存在。
My next guess is that the assets aren't loading quick enough.. Phantomjs is faster than a speeding bullet!
我的下一个猜测是资产加载速度不够快.. Phantomjs 比超速子弹还快!
So lets try to slow it down artificially and see if we can get rid of that TypeError...
所以让我们尝试人为地减慢它的速度,看看我们是否可以摆脱那个 TypeError ......
var page = require('webpage').create();
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36';
page.open('http://github.com', function (status) {
window.setTimeout(function () {
page.render('github.png');
phantom.exit();
}, 3000);
});
That didn't work... After a closer inspection of the image - it is clear that some elements are missing. Mainly some icons and the logo.
那没有用......在仔细检查图像后 - 很明显缺少某些元素。主要是一些图标和标志。
Success?Partially because we are now at least getting a screen shot where earlier, we weren't getting a thing.
成功?部分原因是我们现在至少得到了一个屏幕截图,而之前我们什么也没得到。
Job done?Not exactly. Need to determine what is causing that TypeError because it preventing some assets from loading and distorting the image.
任务完成?不完全是。需要确定导致 TypeError 的原因,因为它阻止了某些资产加载和扭曲图像。
Additional
额外的
Attempted to recreate with CasperJS --debug is very ugly and hard to follow compared to PhantomJS:
与 PhantomJS 相比,尝试使用 CasperJS --debug 重新创建非常丑陋且难以理解:
casper.start();
casper.userAgent('Mozilla/5.0 (Macintosh; Intel Mac OS X)');
casper.thenOpen('https://www.github.com/', function() {
this.captureSelector('github.png', 'body');
});
casper.run();
console:
安慰:
casperjs test --ssl-protocol=any --debug=true github.js
Further the image is missing the same icons but is also visually distorted. Being that CasperJs relies on Phantomjs, I do not see the value in using it for this specific task.
此外,图像缺少相同的图标,但在视觉上也失真。由于 CasperJs 依赖于 Phantomjs,我没有看到将它用于此特定任务的价值。
If you would like to add to my answer, please share your findings. Very interested in a flawless PhantomJS solution
如果您想添加到我的答案中,请分享您的发现。对完美无瑕的 PhantomJS 解决方案非常感兴趣
Update #1 : Removing the TypeError
更新 #1:删除 TypeError
@ArtjomB points out that Phantomjs does not support js bind
in it's current version as of this update (1.9.7). For this reason he explains: ArtjomB: PhantomJs Bind Issue Answer
@ArtjomB 指出 Phantomjsbind
在其当前版本(1.9.7)中不支持 js 。出于这个原因,他解释说:ArtjomB:PhantomJs 绑定问题答案
The TypeError: 'undefined' is not a function refers to bind, because PhantomJS 1.x doesn't support it. PhantomJS 1.x uses an old fork of QtWebkit which is comparable to Chrome 13 or Safari 5. The newer PhantomJS 2 will use a newer engine which will support bind. For now you need to add a shim inside of the page.onInitialized event handler:
TypeError: 'undefined' is not a function 引用 bind,因为 PhantomJS 1.x 不支持它。PhantomJS 1.x 使用 QtWebkit 的旧分支,可与 Chrome 13 或 Safari 5 相媲美。较新的 PhantomJS 2 将使用支持绑定的较新引擎。现在你需要在 page.onInitialized 事件处理程序中添加一个垫片:
Ok great, so the following code will take care of our TypeError
from above. (But not fully functional, see below for details)
好的,所以下面的代码TypeError
将从上面处理我们。(但功能不全,详情见下文)
var page = require('webpage').create();
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36';
page.open('http://github.com', function (status) {
window.setTimeout(function () {
page.render('github.png');
phantom.exit();
}, 5000);
});
page.onInitialized = function(){
page.evaluate(function(){
var isFunction = function(o) {
return typeof o == 'function';
};
var bind,
slice = [].slice,
proto = Function.prototype,
featureMap;
featureMap = {
'function-bind': 'bind'
};
function has(feature) {
var prop = featureMap[feature];
return isFunction(proto[prop]);
}
// check for missing features
if (!has('function-bind')) {
// adapted from Mozilla Developer Network example at
// https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/Function/bind
bind = function bind(obj) {
var args = slice.call(arguments, 1),
self = this,
nop = function() {
},
bound = function() {
return self.apply(this instanceof nop ? this : (obj || {}), args.concat(slice.call(arguments)));
};
nop.prototype = this.prototype || {}; // Firefox cries sometimes if prototype is undefined
bound.prototype = new nop();
return bound;
};
proto.bind = bind;
}
});
}
Now the above code will get us a screenshot same as we were getting before AND debug will not show a TypeError
so from the surface, everything appears to work. Progress has been made.
现在,上面的代码将为我们提供与之前相同的屏幕截图,并且调试不会TypeError
从表面上显示so,一切似乎都可以正常工作。已经取得了进展。
Unfortunately, all of the image icons [logo, etc] are still not loading correctly. We see some sort of 3Wicon not sure where thats from.
不幸的是,所有图像图标 [徽标等] 仍未正确加载。我们看到某种3W图标,不确定那是从哪里来的。
Thanks for the help @ArtjomB
感谢@ArtjomB 的帮助