javascript 执行页面的javascript后保存页面的html输出
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16856036/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
save html output of page after execution of the page's javascript
提问by gyaani_guy
There is a site I am trying to scrape, that first loads an html/js modifies the form input fields using js and then POSTs. How can I get the final html output of the POSTed page?
有一个我正在尝试抓取的站点,它首先加载一个 html/js,然后使用 js 和 POST 修改表单输入字段。如何获得 POSTed 页面的最终 html 输出?
I tried to do this with phantomjs, but it seems to only have an option to render image files. Googling around suggests it should be possible , but I can't figure out how. My attempt:
我试图用 phantomjs 来做到这一点,但它似乎只有一个选项来渲染图像文件。谷歌搜索表明它应该是可能的,但我不知道如何。我的尝试:
var page = require('webpage').create();
var fs = require('fs');
page.open('https://www.somesite.com/page.aspx', function () {
page.evaluate(function(){
});
page.render('export.png');
fs.write('1.html', page.content, 'w');
phantom.exit();
});
This code will be used for a client, I can't expect him to install too many packages (nodejs , casperjs etc)
此代码将用于客户端,我不能指望他安装太多包(nodejs、casperjs 等)
Thanks
谢谢
回答by uffa
the output code you have is correct, but there is an issue with synchronicity. The output lines that you have are being executed before the page is done loading. You can tie into the onLoadFinished Callback to find out when that happens. See full code below.
您拥有的输出代码是正确的,但存在同步问题。在页面加载完成之前正在执行的输出行。您可以绑定 onLoadFinished 回调以了解发生这种情况的时间。请参阅下面的完整代码。
var page = new WebPage()
var fs = require('fs');
page.onLoadFinished = function() {
console.log("page load finished");
page.render('export.png');
fs.write('1.html', page.content, 'w');
phantom.exit();
};
page.open("http://www.google.com", function() {
page.evaluate(function() {
});
});
When using a site like google, it can be deceiving because it loads so quicker, that you can often execute a screengrab inline like you have it. Timing is a tricky thing in phantomjs, sometimes I test with setTimeout to see if timing is an issue.
当使用像谷歌这样的网站时,它可能是骗人的,因为它加载得如此之快,以至于您经常可以像拥有它一样执行内联屏幕抓取。在 phantomjs 中计时是一件棘手的事情,有时我会使用 setTimeout 进行测试以查看计时是否有问题。
回答by Owen Martin
When I copied your code directly, and changed the URL to www.google.com, it worked fine, with two files saved:
当我直接复制您的代码并将 URL 更改为 www.google.com 时,它工作正常,并保存了两个文件:
- 1.html
- export.png
- 1.html
- 导出.png
Bear in mind that the files will be written to the location you run the script from, not where your .js file is located
请记住,文件将写入您运行脚本的位置,而不是您的 .js 文件所在的位置
回答by Heitor
After 2 long days of struggling and frustration I finally got my similar issue solved. What did the trick was the waitfor.jsexample in PhantomJS' official website. Be happy!
经过两天的挣扎和沮丧,我终于解决了我的类似问题。PhantomJS 官网的waitfor.js例子有什么用。要开心!
"use strict";
function waitFor(testFx, onReady, timeOutMillis) {
var maxtimeOutMillis = timeOutMillis ? timeOutMillis : 3000, //< Default Max Timout is 3s
start = new Date().getTime(),
condition = false,
interval = setInterval(function() {
if ( (new Date().getTime() - start < maxtimeOutMillis) && !condition ) {
// If not time-out yet and condition not yet fulfilled
condition = (typeof(testFx) === "string" ? eval(testFx) : testFx()); //< defensive code
} else {
if(!condition) {
// If condition still not fulfilled (timeout but condition is 'false')
console.log("'waitFor()' timeout");
phantom.exit(1);
} else {
// Condition fulfilled (timeout and/or condition is 'true')
console.log("'waitFor()' finished in " + (new Date().getTime() - start) + "ms.");
typeof(onReady) === "string" ? eval(onReady) : onReady(); //< Do what it's supposed to do once the condition is fulfilled
clearInterval(interval); //< Stop this interval
}
}
}, 250); //< repeat check every 250ms
};
var page = require('webpage').create();
// Open Twitter on 'sencha' profile and, onPageLoad, do...
page.open("http://twitter.com/#!/sencha", function (status) {
// Check for page load success
if (status !== "success") {
console.log("Unable to access network");
} else {
// Wait for 'signin-dropdown' to be visible
waitFor(function() {
// Check in the page if a specific element is now visible
return page.evaluate(function() {
return $("#signin-dropdown").is(":visible");
});
}, function() {
console.log("The sign-in dialog should be visible now.");
phantom.exit();
});
}
});
回答by strah
回答by Ben Hutchison
I'm using CasperJSto run tests with PhantomJS. I added this code to my tearDownfunction:
我正在使用CasperJS通过 PhantomJS 运行测试。我将此代码添加到我的tearDown函数中:
var require = patchRequire(require);
var fs = require('fs');
casper.test.begin("My Test", {
tearDown: function(){
casper.capture("export.png");
fs.write("1.html", casper.getHTML(undefined, true), 'w');
},
test: function(test){
// test code
casper.run(function(){
test.done();
});
}
});
See docs for captureand getHTML.
有关capture和getHTML 的信息,请参阅文档。
回答by Dropout
one approach that comes to my mind, besides using a headless browser is obviously to simulate the ajax calls and to ensemble the page post-process, request by request.. this however is often kind of tricky and should be used as a last resort, unless you really like to dig through javascript code..
我想到的一种方法,除了使用无头浏览器显然是模拟 ajax 调用并集成页面后处理,按请求请求......然而,这通常有点棘手,应该作为最后的手段,除非你真的很喜欢钻研 javascript 代码。
回答by Sem Voigtl?nder
This can easily be done with some php code and javascriptuse fopen() and fwrite() and this function to save it: var generatedSource = new XMLSerializer().serializeToString(document);
这可以通过一些 php 代码和 javascript 轻松完成,使用 fopen() 和 fwrite() 以及这个函数来保存它: vargeneratedSource = new XMLSerializer().serializeToString(document);