javascript 执行页面的javascript后保存页面的html输出

Question

提问by gyaani_guy

There is a site I am trying to scrape, that first loads an html/js modifies the form input fields using js and then POSTs. How can I get the final html output of the POSTed page?

有一个我正在尝试抓取的站点，它首先加载一个 html/js，然后使用 js 和 POST 修改表单输入字段。如何获得 POSTed 页面的最终 html 输出？

I tried to do this with phantomjs, but it seems to only have an option to render image files. Googling around suggests it should be possible , but I can't figure out how. My attempt:

我试图用 phantomjs 来做到这一点，但它似乎只有一个选项来渲染图像文件。谷歌搜索表明它应该是可能的，但我不知道如何。我的尝试：

var page = require('webpage').create();
var fs = require('fs');
page.open('https://www.somesite.com/page.aspx', function () {
    page.evaluate(function(){

    });

    page.render('export.png');
    fs.write('1.html', page.content, 'w');
    phantom.exit();
});

This code will be used for a client, I can't expect him to install too many packages (nodejs , casperjs etc)

此代码将用于客户端，我不能指望他安装太多包（nodejs、casperjs 等）

Thanks

谢谢

Answer 1

回答by uffa

the output code you have is correct, but there is an issue with synchronicity. The output lines that you have are being executed before the page is done loading. You can tie into the onLoadFinished Callback to find out when that happens. See full code below.

您拥有的输出代码是正确的，但存在同步问题。在页面加载完成之前正在执行的输出行。您可以绑定 onLoadFinished 回调以了解发生这种情况的时间。请参阅下面的完整代码。

    var page = new WebPage()
    var fs = require('fs');

    page.onLoadFinished = function() {
      console.log("page load finished");
      page.render('export.png');
      fs.write('1.html', page.content, 'w');
      phantom.exit();
    };

    page.open("http://www.google.com", function() {
      page.evaluate(function() {
      });
    });

When using a site like google, it can be deceiving because it loads so quicker, that you can often execute a screengrab inline like you have it. Timing is a tricky thing in phantomjs, sometimes I test with setTimeout to see if timing is an issue.

当使用像谷歌这样的网站时，它可能是骗人的，因为它加载得如此之快，以至于您经常可以像拥有它一样执行内联屏幕抓取。在 phantomjs 中计时是一件棘手的事情，有时我会使用 setTimeout 进行测试以查看计时是否有问题。

Answer 2

回答by Owen Martin

When I copied your code directly, and changed the URL to www.google.com, it worked fine, with two files saved:

当我直接复制您的代码并将 URL 更改为 www.google.com 时，它工作正常，并保存了两个文件：

1.html
export.png

1.html
导出.png

Bear in mind that the files will be written to the location you run the script from, not where your .js file is located

请记住，文件将写入您运行脚本的位置，而不是您的 .js 文件所在的位置

Answer 3

回答by Heitor

After 2 long days of struggling and frustration I finally got my similar issue solved. What did the trick was the waitfor.jsexample in PhantomJS' official website. Be happy!

经过两天的挣扎和沮丧，我终于解决了我的类似问题。PhantomJS 官网的waitfor.js例子有什么用。要开心！

"use strict";

function waitFor(testFx, onReady, timeOutMillis) {
    var maxtimeOutMillis = timeOutMillis ? timeOutMillis : 3000, //< Default Max Timout is 3s
        start = new Date().getTime(),
        condition = false,
        interval = setInterval(function() {
            if ( (new Date().getTime() - start < maxtimeOutMillis) && !condition ) {
                // If not time-out yet and condition not yet fulfilled
                condition = (typeof(testFx) === "string" ? eval(testFx) : testFx()); //< defensive code
            } else {
                if(!condition) {
                    // If condition still not fulfilled (timeout but condition is 'false')
                    console.log("'waitFor()' timeout");
                    phantom.exit(1);
                } else {
                    // Condition fulfilled (timeout and/or condition is 'true')
                    console.log("'waitFor()' finished in " + (new Date().getTime() - start) + "ms.");
                    typeof(onReady) === "string" ? eval(onReady) : onReady(); //< Do what it's supposed to do once the condition is fulfilled
                    clearInterval(interval); //< Stop this interval
                }
            }
        }, 250); //< repeat check every 250ms
};


var page = require('webpage').create();

// Open Twitter on 'sencha' profile and, onPageLoad, do...
page.open("http://twitter.com/#!/sencha", function (status) {
    // Check for page load success
    if (status !== "success") {
        console.log("Unable to access network");
    } else {
        // Wait for 'signin-dropdown' to be visible
        waitFor(function() {
            // Check in the page if a specific element is now visible
            return page.evaluate(function() {
                return $("#signin-dropdown").is(":visible");
            });
        }, function() {
           console.log("The sign-in dialog should be visible now.");
           phantom.exit();
        });
    }
});

Answer 4

回答by strah

I tried several approaches to similar task and the best results I got using Selenium.

我尝试了几种类似任务的方法，并且使用 Selenium 获得了最好的结果。

Before I tried PhantomJS and Cheerio. Phantom was crashing too often while executing JS on the page.

在我尝试 PhantomJS 和Cheerio之前。Phantom 在页面上执行 JS 时经常崩溃。

Answer 5

回答by Ben Hutchison

I'm using CasperJSto run tests with PhantomJS. I added this code to my tearDownfunction:

我正在使用CasperJS通过 PhantomJS 运行测试。我将此代码添加到我的tearDown函数中：

var require = patchRequire(require);
var fs = require('fs');

casper.test.begin("My Test", {
    tearDown: function(){
        casper.capture("export.png");
        fs.write("1.html", casper.getHTML(undefined, true), 'w');
    },
    test: function(test){
        // test code

        casper.run(function(){
            test.done();
        });
    }
});

See docs for captureand getHTML.

有关capture和getHTML 的信息，请参阅文档。

Answer 6

回答by Dropout

one approach that comes to my mind, besides using a headless browser is obviously to simulate the ajax calls and to ensemble the page post-process, request by request.. this however is often kind of tricky and should be used as a last resort, unless you really like to dig through javascript code..

我想到的一种方法，除了使用无头浏览器显然是模拟 ajax 调用并集成页面后处理，按请求请求......然而，这通常有点棘手，应该作为最后的手段，除非你真的很喜欢钻研 javascript 代码。

Answer 7

回答by Sem Voigtl?nder

This can easily be done with some php code and javascriptuse fopen() and fwrite() and this function to save it: var generatedSource = new XMLSerializer().serializeToString(document);

这可以通过一些 php 代码和 javascript 轻松完成，使用 fopen() 和 fwrite() 以及这个函数来保存它： vargeneratedSource = new XMLSerializer().serializeToString(document);

javascript 执行页面的javascript后保存页面的html输出

提问by gyaani_guy

回答by uffa

回答by Owen Martin

回答by Heitor

回答by strah

回答by Ben Hutchison

回答by Dropout

回答by Sem Voigtl?nder

相关推荐

最近更新

标签

javascript 执行页面的javascript后保存页面的html输出

提问by gyaani_guy

回答by uffa

回答by Owen Martin

回答by Heitor

回答by strah

回答by Ben Hutchison

回答by Dropout

回答by Sem Voigtl?nder

相关推荐

javascript 使 d3.js 兼容 IE8/IE9

javascript 关闭后如何在jquery中重新打开模态对话框？

javascript Jquery DataTable Sorting Numeric value 列无法正常工作

javascript CSS :: contenteditable 中元素的焦点

相关推荐

最近更新

标签