javascript 抓取动态页面内容phantomjs
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/13805215/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Scraping dynamic page content phantomjs
提问by user985590
My company is using a website that hosts all of our FAQ and customer questions. We have plans to go through and wipe out all of the old data and input new and the service does not have a backup, or archive option for questions we don't want to appear anymore.
我的公司正在使用一个网站,其中包含我们所有的常见问题解答和客户问题。我们计划通过并清除所有旧数据并输入新数据,并且该服务没有备份或存档选项来解决我们不想再出现的问题。
I've gone through and tried to scape the site using perl and mechanize, but I'm missing the customer comments on the page as they are loaded through ajax. I have looked at phantomjs and can get the pages to save to an image using an example page, however, I'd like to get an full page html dump of the page, but can't figure out how. I used this example code on our site
我已经通过并尝试使用 perl 和机械化来逃避网站,但是当他们通过 ajax 加载时,我错过了页面上的客户评论。我看过 phantomjs 并且可以使用示例页面将页面保存到图像中,但是,我想获得页面的完整页面 html 转储,但不知道如何。我在我们的网站上使用了这个示例代码
var page = new WebPage();
page.open('http://espn.go.com/nfl/', function (status) {
//once page loaded, include jQuery from cdn
page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() {
//once jQuery loaded, run some code
//inserts our custom text into the page
page.evaluate(function(){$("h2").html('Many NFL Players Scared that Chad Moon Will Enter League');});
//take screenshot and exit
page.render('espn.png');
phantom.exit();
});
});
Is there a way using phantomjs that I can just get a full page dump of the data, similar to if I did a view source in chrome? I can do this with perl + mechanize, but don't see how to do this using phantomjs.
有没有一种方法可以使用 phantomjs 来获取数据的整页转储,类似于我在 chrome 中查看源代码?我可以用 perl + mechanize 做到这一点,但不知道如何使用 phantomjs 做到这一点。
回答by McMeep
You can use page.content
to get the full HTML DOM
您可以使用page.content
获取完整的 HTML DOM
回答by Radhouane Fazai
I would recommend pjscrape http://nrabinowitz.github.com/pjscrape/if you want to scrape using PhantomJS
如果你想使用 PhantomJS 抓取,我会推荐 pjscrape http://nrabinowitz.github.com/pjscrape/