javascript 采取可靠的网站截图?Phantomjs 和 Casperjs 在某些网站上都返回空屏幕截图

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26517852/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-28 06:04:10  来源:igfitidea点击:

Taking reliable screenshots of websites? Phantomjs and Casperjs both return empty screen shots on some websites

javascriptphantomjsscreen-scrapingcasperjs

提问by fabbb

Open a web page and take a screenshot.

打开一个网页并截取屏幕截图。

Using ONLY phantomjs: (this is a simple script, in fact it is the example script used in their docs. http://phantomjs.org/screen-capture.html

只使用phantomjs:(这是一个简单的脚本,实际上它是在他们的文档中使用的示例脚本。http://phantomjs.org/screen-capture.html

var page = require('webpage').create();
page.open('http://github.com/', function() {
  page.render('github.png');
  phantom.exit();
});

Problem is that for some websites (like github) funny enough are somehow detecting and not serving phantomjs and nothing is being rendered. Result is github.pngis a blank white png file.

问题是,对于某些网站(如 github)来说,有趣的是以某种方式检测而不提供 phantomjs 并且没有呈现任何内容。结果是github.png一个空白的白色 png 文件。

Replace github with say: "google.com" and you get a nice (proper) screenshot as is intended.

将 github 替换为:“google.com”,您将获得预期的漂亮(正确)屏幕截图。

At first I thought this was a Phantomjs issue so I tried running it through Casperjs with:

起初我认为这是 Phantomjs 的问题,所以我尝试通过 Casperjs 运行它:

casper.start('http://www.github.com/', function() {
    this.captureSelector('github.png', 'body');
});

casper.run();

But I get same behavior as with Phantomjs.

但我的行为与 Phantomjs 相同。

So I figured ok this is most likely a user agent issue. As in: Github sniffs out Phantomjs and decides not to show the page. So I set the user agent like below but that still didn't work.

所以我认为这很可能是用户代理问题。如:Github 嗅出 Phantomjs 并决定不显示该页面。所以我设置了如下所示的用户代理,但这仍然不起作用。

var page = require('webpage').create();
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36';
page.open('http://github.com/', function() {
  page.render('github.png');
  phantom.exit();
});

So then I tried to parse the page and apparently some sites (again like github) don't appear to be sending anything down the wire.

然后我尝试解析页面,显然有些站点(再次像 github)似乎没有发送任何内容。

Using casperjs I tried to print the title. And for google.com I got back Googlebut for github.com I got back bupkis. Example code:

使用 casperjs 我尝试打印标题。对于 google.com 我回来了,Google但对于 github.com 我回来了 bupkis。示例代码:

var casper = require('casper').create();

casper.start('http://github.com/', function() {
    this.echo(this.getTitle());
});

casper.run();  

The same as above also produces the same result in purely phantomjs.

同上在纯phantomjs中也产生同样的结果。

Update:

更新:

Could this be a timing issue? Is github just super slow? I doubt it but lets test anyway..

这可能是时间问题吗?github 是不是超级慢?我对此表示怀疑,但无论如何让我们测试一下..

var page = require('webpage').create();
page.open('http://github.com', function (status) {
    /* irrelevant */
   window.setTimeout(function () {
            page.render('github.png');
            phantom.exit();
        }, 3000);
});

And the result is still bupkis. So no it's not a timing issue.

结果仍然是bupkis。所以不,这不是时间问题。

  1. How are some sites like github blocking phantomjs?
  2. How can we reliably take screenshots of ALL webpages? Required to be fast, and headless.
  1. 像 github 这样的网站如何阻止 phantomjs?
  2. 我们如何可靠地截取所有网页的截图?要求快速,无头。

回答by fabbb

After bouncing this around for some time I was able to narrow down the problem. Apparently PhantomJS uses a default ssl of sslv3which causes github to refuse the connection due to a bad ssl handshake

经过一段时间的弹跳之后,我能够缩小问题的范围。显然 PhantomJS 使用默认的 sslsslv3导致 github 由于糟糕的 ssl 握手而拒绝连接

phantomjs --debug=true github.js

Shows output of:

显示输出:

. . .
2014-10-22T19:48:31 [DEBUG] WebPage - updateLoadingProgress: 10 
2014-10-22T19:48:32 [DEBUG] Network - Resource request error: 6 ( "SSL handshake failed" ) URL: "https://github.com/" 
2014-10-22T19:48:32 [DEBUG] WebPage - updateLoadingProgress: 100 

So from this we can conclude that no screen was taken because github was refusing the connection. Great that makes perfect sense. So let's set SSL flag to --ssl-protocol=anyand lets also ignore ssl-errors with --ignore-ssl-errors=true

所以由此我们可以得出结论,没有截屏是因为 github 拒绝了连接。太棒了,完全有道理。因此,让我们将 SSL 标志设置为--ssl-protocol=any并忽略 ssl 错误--ignore-ssl-errors=true

phantomjs --ignore-ssl-errors=true --ssl-protocol=any --debug=true github.js

Great success!A screenshot is now being rendered and saved properly but debugger is showing us a TypeError:

巨大的成功!现在正在渲染并正确保存屏幕截图,但调试器向我们显示了一个类型错误:

TypeError: 'undefined' is not a function (evaluating 'Array.prototype.forEach.call.bind(Array.prototype.forEach)')

  https://assets-cdn.github.com/assets/frameworks-dabc650f8a51dffd1d4376a3522cbda5536e4807e01d2a86ff7e60d8d6ee3029.js:29
  https://assets-cdn.github.com/assets/frameworks-dabc650f8a51dffd1d4376a3522cbda5536e4807e01d2a86ff7e60d8d6ee3029.js:29
2014-10-22T19:52:32 [DEBUG] WebPage - updateLoadingProgress: 72 
2014-10-22T19:52:32 [DEBUG] WebPage - updateLoadingProgress: 88 
ReferenceError: Can't find variable: $

  https://assets-cdn.github.com/assets/github-fa2f009761e3bc4750ed00845b9717b09646361cbbc3fa473ad64de9ca6ccf5b.js:1
  https://assets-cdn.github.com/assets/github-fa2f009761e3bc4750ed00845b9717b09646361cbbc3fa473ad64de9ca6ccf5b.js:1

I checked the github homepage manually just to see if a TypeError existed and it does NOT.

我手动检查了 github 主页只是为了查看是否存在 TypeError 而它不存在。

My next guess is that the assets aren't loading quick enough.. Phantomjs is faster than a speeding bullet!

我的下一个猜测是资产加载速度不够快.. Phantomjs 比超速子弹还快!

So lets try to slow it down artificially and see if we can get rid of that TypeError...

所以让我们尝试人为地减慢它的速度,看看我们是否可以摆脱那个 TypeError ......

var page = require('webpage').create();
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36';
page.open('http://github.com', function (status) {
   window.setTimeout(function () {
            page.render('github.png');
            phantom.exit();
        }, 3000);
});

That didn't work... After a closer inspection of the image - it is clear that some elements are missing. Mainly some icons and the logo.

那没有用......在仔细检查图像后 - 很明显缺少某些元素。主要是一些图标和标志。

Success?Partially because we are now at least getting a screen shot where earlier, we weren't getting a thing.

成功?部分原因是我们现在至少得到了一个屏幕截图,而之前我们什么也没得到。

Job done?Not exactly. Need to determine what is causing that TypeError because it preventing some assets from loading and distorting the image.

任务完成?不完全是。需要确定导致 TypeError 的原因,因为它阻止了某些资产加载和扭曲图像。

Additional

额外的

Attempted to recreate with CasperJS --debug is very ugly and hard to follow compared to PhantomJS:

与 PhantomJS 相比,尝试使用 CasperJS --debug 重新创建非常丑陋且难以理解:

casper.start();
casper.userAgent('Mozilla/5.0 (Macintosh; Intel Mac OS X)');
casper.thenOpen('https://www.github.com/', function() {
    this.captureSelector('github.png', 'body');
});

casper.run();

console:

安慰:

casperjs test --ssl-protocol=any --debug=true github.js

Further the image is missing the same icons but is also visually distorted. Being that CasperJs relies on Phantomjs, I do not see the value in using it for this specific task.

此外,图像缺少相同的图标,但在视觉上也失真。由于 CasperJs 依赖于 Phantomjs,我没有看到将它用于此特定任务的价值。

If you would like to add to my answer, please share your findings. Very interested in a flawless PhantomJS solution

如果您想添加到我的答案中,请分享您的发现。对完美无瑕的 PhantomJS 解决方案非常感兴趣

Update #1 : Removing the TypeError

更新 #1:删除 TypeError

@ArtjomB points out that Phantomjs does not support js bindin it's current version as of this update (1.9.7). For this reason he explains: ArtjomB: PhantomJs Bind Issue Answer

@ArtjomB 指出 Phantomjsbind在其当前版本(1.9.7)中不支持 js 。出于这个原因,他解释说:ArtjomB:PhantomJs 绑定问题答案

The TypeError: 'undefined' is not a function refers to bind, because PhantomJS 1.x doesn't support it. PhantomJS 1.x uses an old fork of QtWebkit which is comparable to Chrome 13 or Safari 5. The newer PhantomJS 2 will use a newer engine which will support bind. For now you need to add a shim inside of the page.onInitialized event handler:

TypeError: 'undefined' is not a function 引用 bind,因为 PhantomJS 1.x 不支持它。PhantomJS 1.x 使用 QtWebkit 的旧分支,可与 Chrome 13 或 Safari 5 相媲美。较新的 PhantomJS 2 将使用支持绑定的较新引擎。现在你需要在 page.onInitialized 事件处理程序中添加一个垫片:

Ok great, so the following code will take care of our TypeErrorfrom above. (But not fully functional, see below for details)

好的,所以下面的代码TypeError将从上面处理我们。(但功能不全,详情见下文)

var page = require('webpage').create();
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36';
page.open('http://github.com', function (status) {
   window.setTimeout(function () {
            page.render('github.png');
            phantom.exit();
        }, 5000);
});
page.onInitialized = function(){
    page.evaluate(function(){
        var isFunction = function(o) {
          return typeof o == 'function';
        };

        var bind,
          slice = [].slice,
          proto = Function.prototype,
          featureMap;

        featureMap = {
          'function-bind': 'bind'
        };

        function has(feature) {
          var prop = featureMap[feature];
          return isFunction(proto[prop]);
        }

        // check for missing features
        if (!has('function-bind')) {
          // adapted from Mozilla Developer Network example at
          // https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/Function/bind
          bind = function bind(obj) {
            var args = slice.call(arguments, 1),
              self = this,
              nop = function() {
              },
              bound = function() {
                return self.apply(this instanceof nop ? this : (obj || {}), args.concat(slice.call(arguments)));
              };
            nop.prototype = this.prototype || {}; // Firefox cries sometimes if prototype is undefined
            bound.prototype = new nop();
            return bound;
          };
          proto.bind = bind;
        }
    });
}

Now the above code will get us a screenshot same as we were getting before AND debug will not show a TypeErrorso from the surface, everything appears to work. Progress has been made.

现在,上面的代码将为我们提供与之前相同的屏幕截图,并且调试不会TypeError从表面上显示so,一切似乎都可以正常工作。已经取得了进展。

Unfortunately, all of the image icons [logo, etc] are still not loading correctly. We see some sort of 3Wicon not sure where thats from.

不幸的是,所有图像图标 [徽标等] 仍未正确加载。我们看到某种3W图标,不确定那是从哪里来的。

Thanks for the help @ArtjomB

感谢@ArtjomB 的帮助

enter image description here

在此处输入图片说明