Javascript phantomjs 不等待“完整”页面加载

Question

提问by nilfalse

I'm using PhantomJSv1.4.1 to load some web pages. I don't have access to their server-side, I just getting links pointing to them. I'm using obsolete version of Phantom because I need to support Adobe Flash on that web pages.

我正在使用PhantomJSv1.4.1 来加载一些网页。我无权访问他们的服务器端，我只是获得指向他们的链接。我使用的是过时版本的 Phantom，因为我需要在该网页上支持 Adobe Flash。

The problem is many web-sites are loading their minor content async and that's why Phantom's onLoadFinished callback (analogue for onLoad in HTML) fired too early when not everything still has loaded. Can anyone suggest how can I wait for full load of a webpage to make, for example, a screenshot with all dynamic content like ads?

问题是许多网站都在异步加载它们的次要内容，这就是为什么 Phantom 的 onLoadFinished 回调（类似于 HTML 中的 onLoad）在并非所有内容都已加载时过早触发的原因。任何人都可以建议我如何等待网页满载来制作，例如，包含所有动态内容（如广告）的屏幕截图？

Answer 1

回答by rhunwicks

Another approach is to just ask PhantomJS to wait for a bit after the page has loaded before doing the render, as per the regular rasterize.jsexample, but with a longer timeout to allow the JavaScript to finish loading additional resources:

另一种方法是让 PhantomJS 在页面加载后等待一段时间再进行渲染，就像常规的rasterize.js示例一样，但有更长的超时时间以允许 JavaScript 完成加载其他资源：

page.open(address, function (status) {
    if (status !== 'success') {
        console.log('Unable to load the address!');
        phantom.exit();
    } else {
        window.setTimeout(function () {
            page.render(output);
            phantom.exit();
        }, 1000); // Change timeout as required to allow sufficient time 
    }
});

Answer 2

回答by Mateusz Charytoniuk

I would rather periodically check for document.readyStatestatus (https://developer.mozilla.org/en-US/docs/Web/API/document.readyState). Although this approach is a bit clunky, you can be sure that inside onPageReadyfunction you are using fully loaded document.

我宁愿定期检查document.readyState状态（https://developer.mozilla.org/en-US/docs/Web/API/document.readyState）。虽然这种方法有点笨拙，但您可以确定onPageReady您正在使用完整加载的文件内部函数。

var page = require("webpage").create(),
    url = "http://example.com/index.html";

function onPageReady() {
    var htmlContent = page.evaluate(function () {
        return document.documentElement.outerHTML;
    });

    console.log(htmlContent);

    phantom.exit();
}

page.open(url, function (status) {
    function checkReadyState() {
        setTimeout(function () {
            var readyState = page.evaluate(function () {
                return document.readyState;
            });

            if ("complete" === readyState) {
                onPageReady();
            } else {
                checkReadyState();
            }
        });
    }

    checkReadyState();
});

Additional explanation:

补充说明：

Using nested setTimeoutinstead of setIntervalprevents checkReadyStatefrom "overlapping" and race conditions when its execution is prolonged for some random reasons. setTimeouthas a default delay of 4ms (https://stackoverflow.com/a/3580085/1011156) so active polling will not drastically affect program performance.

当由于某些随机原因延长执行时间时，使用嵌套setTimeout而不是setInterval防止checkReadyState“重叠”和竞争条件。setTimeout默认延迟为 4 毫秒（https://stackoverflow.com/a/3580085/1011156），因此主动轮询不会显着影响程序性能。

document.readyState === "complete"means that document is completely loaded with all resources (https://html.spec.whatwg.org/multipage/dom.html#current-document-readiness).

document.readyState === "complete"意味着该文档已完全加载了所有资源（https://html.spec.whatwg.org/multipage/dom.html#current-document-readiness）。

Answer 3

回答by rhunwicks

You could try a combination of the waitfor and rasterize examples:

您可以尝试结合使用 waitfor 和 rasterize 示例：

/**
 * See https://github.com/ariya/phantomjs/blob/master/examples/waitfor.js
 * 
 * Wait until the test condition is true or a timeout occurs. Useful for waiting
 * on a server response or for a ui change (fadeIn, etc.) to occur.
 *
 * @param testFx javascript condition that evaluates to a boolean,
 * it can be passed in as a string (e.g.: "1 == 1" or "$('#bar').is(':visible')" or
 * as a callback function.
 * @param onReady what to do when testFx condition is fulfilled,
 * it can be passed in as a string (e.g.: "1 == 1" or "$('#bar').is(':visible')" or
 * as a callback function.
 * @param timeOutMillis the max amount of time to wait. If not specified, 3 sec is used.
 */
function waitFor(testFx, onReady, timeOutMillis) {
    var maxtimeOutMillis = timeOutMillis ? timeOutMillis : 3000, //< Default Max Timout is 3s
        start = new Date().getTime(),
        condition = (typeof(testFx) === "string" ? eval(testFx) : testFx()), //< defensive code
        interval = setInterval(function() {
            if ( (new Date().getTime() - start < maxtimeOutMillis) && !condition ) {
                // If not time-out yet and condition not yet fulfilled
                condition = (typeof(testFx) === "string" ? eval(testFx) : testFx()); //< defensive code
            } else {
                if(!condition) {
                    // If condition still not fulfilled (timeout but condition is 'false')
                    console.log("'waitFor()' timeout");
                    phantom.exit(1);
                } else {
                    // Condition fulfilled (timeout and/or condition is 'true')
                    console.log("'waitFor()' finished in " + (new Date().getTime() - start) + "ms.");
                    typeof(onReady) === "string" ? eval(onReady) : onReady(); //< Do what it's supposed to do once the condition is fulfilled
                    clearInterval(interval); //< Stop this interval
                }
            }
        }, 250); //< repeat check every 250ms
};

var page = require('webpage').create(), system = require('system'), address, output, size;

if (system.args.length < 3 || system.args.length > 5) {
    console.log('Usage: rasterize.js URL filename [paperwidth*paperheight|paperformat] [zoom]');
    console.log('  paper (pdf output) examples: "5in*7.5in", "10cm*20cm", "A4", "Letter"');
    phantom.exit(1);
} else {
    address = system.args[1];
    output = system.args[2];
    if (system.args.length > 3 && system.args[2].substr(-4) === ".pdf") {
        size = system.args[3].split('*');
        page.paperSize = size.length === 2 ? {
            width : size[0],
            height : size[1],
            margin : '0px'
        } : {
            format : system.args[3],
            orientation : 'portrait',
            margin : {
                left : "5mm",
                top : "8mm",
                right : "5mm",
                bottom : "9mm"
            }
        };
    }
    if (system.args.length > 4) {
        page.zoomFactor = system.args[4];
    }
    var resources = [];
    page.onResourceRequested = function(request) {
        resources[request.id] = request.stage;
    };
    page.onResourceReceived = function(response) {
        resources[response.id] = response.stage;
    };
    page.open(address, function(status) {
        if (status !== 'success') {
            console.log('Unable to load the address!');
            phantom.exit();
        } else {
            waitFor(function() {
                // Check in the page if a specific element is now visible
                for ( var i = 1; i < resources.length; ++i) {
                    if (resources[i] != 'end') {
                        return false;
                    }
                }
                return true;
            }, function() {
               page.render(output);
               phantom.exit();
            }, 10000);
        }
    });
}

Answer 4

回答by Dave

Here is a solution that waits for all resource requests to complete. Once complete it will log the page content to the console and generate a screenshot of the rendered page.

这是一个等待所有资源请求完成的解决方案。完成后，它会将页面内容记录到控制台并生成渲染页面的屏幕截图。

Although this solution can serve as a good starting point, I have observed it fail so it's definitely not a complete solution!

虽然这个解决方案可以作为一个很好的起点，但我观察到它失败了，所以它绝对不是一个完整的解决方案！

I didn't have much luck using document.readyState.

我没有太多运气使用document.readyState.

I was influenced by the waitfor.jsexample found on the phantomjs examples page.

我被影响waitfor.js信中例如phantomjs例子页面。

var system = require('system');
var webPage = require('webpage');

var page = webPage.create();
var url = system.args[1];

page.viewportSize = {
  width: 1280,
  height: 720
};

var requestsArray = [];

page.onResourceRequested = function(requestData, networkRequest) {
  requestsArray.push(requestData.id);
};

page.onResourceReceived = function(response) {
  var index = requestsArray.indexOf(response.id);
  requestsArray.splice(index, 1);
};

page.open(url, function(status) {

  var interval = setInterval(function () {

    if (requestsArray.length === 0) {

      clearInterval(interval);
      var content = page.content;
      console.log(content);
      page.render('yourLoadedPage.png');
      phantom.exit();
    }
  }, 500);
});

Answer 5

回答by Supr

Maybe you can use the onResourceRequestedand onResourceReceivedcallbacksto detect asynchronous loading. Here's an example of using those callbacks from their documentation:

也许您可以使用onResourceRequested和onResourceReceived回调来检测异步加载。这是从他们的文档中使用这些回调的示例：

var page = require('webpage').create();
page.onResourceRequested = function (request) {
    console.log('Request ' + JSON.stringify(request, undefined, 4));
};
page.onResourceReceived = function (response) {
    console.log('Receive ' + JSON.stringify(response, undefined, 4));
};
page.open(url);

Also, you can look at examples/netsniff.jsfor a working example.

此外，您可以查看examples/netsniff.js一个工作示例。

Answer 6

回答by deemstone

In my program, I use some logic to judge if it was onload: watching it's network request, if there was no new request on past 200ms, I treat it onload.

在我的程序中，我用一些逻辑来判断它是否是onload：看它的网络请求，如果过去200ms没有新请求，我把它当作onload。

Use this, after onLoadFinish().

在 onLoadFinish() 之后使用它。

function onLoadComplete(page, callback){
    var waiting = [];  // request id
    var interval = 200;  //ms time waiting new request
    var timer = setTimeout( timeout, interval);
    var max_retry = 3;  //
    var counter_retry = 0;

    function timeout(){
        if(waiting.length && counter_retry < max_retry){
            timer = setTimeout( timeout, interval);
            counter_retry++;
            return;
        }else{
            try{
                callback(null, page);
            }catch(e){}
        }
    }

    //for debug, log time cost
    var tlogger = {};

    bindEvent(page, 'request', function(req){
        waiting.push(req.id);
    });

    bindEvent(page, 'receive', function (res) {
        var cT = res.contentType;
        if(!cT){
            console.log('[contentType] ', cT, ' [url] ', res.url);
        }
        if(!cT) return remove(res.id);
        if(cT.indexOf('application') * cT.indexOf('text') != 0) return remove(res.id);

        if (res.stage === 'start') {
            console.log('!!received start: ', res.id);
            //console.log( JSON.stringify(res) );
            tlogger[res.id] = new Date();
        }else if (res.stage === 'end') {
            console.log('!!received end: ', res.id, (new Date() - tlogger[res.id]) );
            //console.log( JSON.stringify(res) );
            remove(res.id);

            clearTimeout(timer);
            timer = setTimeout(timeout, interval);
        }

    });

    bindEvent(page, 'error', function(err){
        remove(err.id);
        if(waiting.length === 0){
            counter_retry = 0;
        }
    });

    function remove(id){
        var i = waiting.indexOf( id );
        if(i < 0){
            return;
        }else{
            waiting.splice(i,1);
        }
    }

    function bindEvent(page, evt, cb){
        switch(evt){
            case 'request':
                page.onResourceRequested = cb;
                break;
            case 'receive':
                page.onResourceReceived = cb;
                break;
            case 'error':
                page.onResourceError = cb;
                break;
            case 'timeout':
                page.onResourceTimeout = cb;
                break;
        }
    }
}

Answer 7

回答by Brankodd

I found this approach useful in some cases:

我发现这种方法在某些情况下很有用：

page.onConsoleMessage(function(msg) {
  // do something e.g. page.render
});

Than if you own the page put some script inside:

如果您拥有该页面，则将一些脚本放入其中：

<script>
  window.onload = function(){
    console.log('page loaded');
  }
</script>

Answer 8

回答by Manu

I found this solution useful in a NodeJS app. I use it just in desperate cases because it launches a timeout in order to wait for the full page load.

我发现这个解决方案在 NodeJS 应用程序中很有用。我只是在绝望的情况下使用它，因为它会启动超时以等待整个页面加载。

The second argument is the callback function which is going to be called once the response is ready.

第二个参数是回调函数，一旦响应准备好将被调用。

phantom = require('phantom');

var fullLoad = function(anUrl, callbackDone) {
    phantom.create(function (ph) {
        ph.createPage(function (page) {
            page.open(anUrl, function (status) {
                if (status !== 'success') {
                    console.error("pahtom: error opening " + anUrl, status);
                    ph.exit();
                } else {
                    // timeOut
                    global.setTimeout(function () {
                        page.evaluate(function () {
                            return document.documentElement.innerHTML;
                        }, function (result) {
                            ph.exit(); // EXTREMLY IMPORTANT
                            callbackDone(result); // callback
                        });
                    }, 5000);
                }
            });
        });
    });
}

var callback = function(htmlBody) {
    // do smth with the htmlBody
}

fullLoad('your/url/', callback);

Answer 9

回答by Dayong

This is an implementation of Supr's answer. Also it uses setTimeout instead of setInterval as Mateusz Charytoniuk suggested.

这是 Supr 答案的实现。它也使用 setTimeout 而不是 setInterval 作为 Mateusz Charytoniuk 建议。

Phantomjs will exit in 1000ms when there isn't any request or response.

当没有任何请求或响应时，Phantomjs 将在 1000 毫秒后退出。

// load the module
var webpage = require('webpage');
// get timestamp
function getTimestamp(){
    // or use Date.now()
    return new Date().getTime();
}

var lastTimestamp = getTimestamp();

var page = webpage.create();
page.onResourceRequested = function(request) {
    // update the timestamp when there is a request
    lastTimestamp = getTimestamp();
};
page.onResourceReceived = function(response) {
    // update the timestamp when there is a response
    lastTimestamp = getTimestamp();
};

page.open(html, function(status) {
    if (status !== 'success') {
        // exit if it fails to load the page
        phantom.exit(1);
    }
    else{
        // do something here
    }
});

function checkReadyState() {
    setTimeout(function () {
        var curentTimestamp = getTimestamp();
        if(curentTimestamp-lastTimestamp>1000){
            // exit if there isn't request or response in 1000ms
            phantom.exit();
        }
        else{
            checkReadyState();
        }
    }, 100);
}

checkReadyState();

Answer 10

回答by Rocco Musolino

This the code I use:

这是我使用的代码：

var system = require('system');
var page = require('webpage').create();

page.open('http://....', function(){
      console.log(page.content);
      var k = 0;

      var loop = setInterval(function(){
          var qrcode = page.evaluate(function(s) {
             return document.querySelector(s).src;
          }, '.qrcode img');

          k++;
          if (qrcode){
             console.log('dataURI:', qrcode);
             clearInterval(loop);
             phantom.exit();
          }

          if (k === 50) phantom.exit(); // 10 sec timeout
      }, 200);
  });

Basically given the fact you're supposed to know that the page is full downloaded when a given element appears on the DOM. So the script is going to wait until this happens.

基本上考虑到当给定元素出现在 DOM 上时，您应该知道页面已完全下载。所以脚本将等到这种情况发生。

Javascript phantomjs 不等待“完整”页面加载

提问by nilfalse

回答by rhunwicks

回答by Mateusz Charytoniuk

回答by rhunwicks

回答by Dave

回答by Supr

回答by deemstone

回答by Brankodd

回答by Manu

回答by Dayong

回答by Rocco Musolino

相关推荐

最近更新

标签

Javascript phantomjs 不等待“完整”页面加载

提问by nilfalse

回答by rhunwicks

回答by Mateusz Charytoniuk

回答by rhunwicks

回答by Dave

回答by Supr

回答by deemstone

回答by Brankodd

回答by Manu

回答by Dayong

回答by Rocco Musolino

相关推荐

Javascript 在之后添加带有 css 伪元素的 onclick

Javascript 有没有什么办法可以用AngularJS写注释，这样在查看源码时就看不到了

Javascript 处理 d3.js 轴上的日期

Javascript Firefox 中的 event.offsetX

相关推荐

最近更新

标签