javascript 如何使用 Phantomjs 向下滚动以加载动态内容
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16561582/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to scroll down with Phantomjs to load dynamic content
提问by Puneet Saini
I am trying to scrape links from a page that generates content dynamically as the user scroll down to the bottom (infinite scrolling). I have tried doing different things with Phantomjs but not able to gather links beyond first page. Let say the element at the bottom which loads content has class .has-more-items
. It is available until final content is loaded while scrolling and then becomes unavailable in DOM (display:none). Here are the things I have tried-
当用户向下滚动到底部(无限滚动)时,我试图从动态生成内容的页面中抓取链接。我尝试用 Phantomjs 做不同的事情,但无法收集第一页以外的链接。假设加载内容的底部元素具有 class .has-more-items
。它在滚动时加载最终内容之前可用,然后在 DOM 中变得不可用(显示:无)。这是我尝试过的事情-
- Setting viewportSize to a large height right after
var page = require('webpage').create();
- 之后立即将 viewportSize 设置为较大的高度
var page = require('webpage').create();
page.viewportSize = { width: 1600, height: 10000, };
page.viewportSize = { 宽度:1600,高度:10000,};
- Using
page.scrollPosition = { top: 10000, left: 0 }
insidepage.open
but have no effect like-
page.scrollPosition = { top: 10000, left: 0 }
在内部使用page.open
但没有效果,例如-
page.open('http://example.com/?q=houston', function(status) { if (status == "success") { page.scrollPosition = { top: 10000, left: 0 }; } });
page.open('http://example.com/?q=houston', function(status) { if (status == "success") { page.scrollPosition = { top: 10000, left: 0 }; } });
- Also tried putting it inside
page.evaluate
function but that gives
- 还尝试将它放在
page.evaluate
函数中,但这给出了
Reference error: Can't find variable page
参考错误:找不到变量页面
- Tried using jQuery and JS code inside
page.evaluate
andpage.open
but to no avail-
- 尝试在内部使用 jQuery 和 JS 代码
page.evaluate
,page.open
但无济于事-
$("html, body").animate({ scrollTop: $(document).height() }, 10, function() { //console.log('check for execution'); });
$("html, body").animate({ scrollTop: $(document).height() }, 10, function() { //console.log('检查执行'); });
as it is and also inside document.ready
. Similarly for JS code-
照原样,也在里面document.ready
。同样对于 JS 代码 -
window.scrollBy(0,10000)
as it is and also inside window.onload
照原样,也在里面 window.onload
I am really struck on it for 2 days now and not able to find a way. Any help or hint would be appreciated.
我真的对它感到震惊了 2 天,但无法找到方法。任何帮助或提示将不胜感激。
Update
更新
I have found a helpful piece of code at https://groups.google.com/forum/?fromgroups=#!topic/phantomjs/8LrWRW8ZrA0
我在https://groups.google.com/forum/?fromgroups=#!topic/phantomjs/8LrWRW8ZrA0找到了一段有用的代码
var hitRockBottom = false; while (!hitRockBottom) {
// Scroll the page (not sure if this is the best way to do so...)
page.scrollPosition = { top: page.scrollPosition + 1000, left: 0 };
// Check if we've hit the bottom
hitRockBottom = page.evaluate(function() {
return document.querySelector(".has-more-items") === null;
}); }
Where .has-more-items
is the element class I want to access which is available at the bottom of the page initially and as we scroll down, it moves further down until all data is loaded and then becomes unavailable.
.has-more-items
我想访问的元素类在哪里,它最初在页面底部可用,当我们向下滚动时,它会进一步向下移动,直到所有数据都加载完毕,然后变得不可用。
However, when I tested it is clear that it is running into infinite loops without scrolling down (I render pictures to check). I have tried to replace page.scrollPosition = { top: page.scrollPosition + 1000, left: 0 };
with codes from below as well (one at a time)
但是,当我测试时,很明显它在不向下滚动的情况下进入无限循环(我渲染图片以进行检查)。我也尝试page.scrollPosition = { top: page.scrollPosition + 1000, left: 0 };
用下面的代码替换(一次一个)
window.document.body.scrollTop = '1000';
location.href = ".has-more-items";
page.scrollPosition = { top: page.scrollPosition + 1000, left: 0 };
document.location.href=".has-more-items";
But nothing seems to work.
但似乎没有任何效果。
采纳答案by Jo?o Pesce
Found a way to do it and tried to adapt to your situation. I didn't test the best way of finding the bottom of the page because I had a different context, but check it out. The problem is that you have to wait a little for the page to load out and javascript works asynchronously so you have to use setInterval
or setTimeout
(see).
找到了一种方法,并尝试适应您的情况。我没有测试找到页面底部的最佳方法,因为我有不同的上下文,但请检查一下。问题是您必须稍等片刻才能加载页面,并且 javascript 异步工作,因此您必须使用setInterval
or setTimeout
(请参阅 参考资料)。
page.open('http://example.com/?q=houston', function () {
// Checks for bottom div and scrolls down from time to time
window.setInterval(function() {
// Checks if there is a div with class=".has-more-items"
// (not sure if this is the best way of doing it)
var count = page.content.match(/class=".has-more-items"/g);
if(count === null) { // Didn't find
page.evaluate(function() {
// Scrolls to the bottom of page
window.document.body.scrollTop = document.body.scrollHeight;
});
}
else { // Found
// Do what you want
...
phantom.exit();
}
}, 500); // Number of milliseconds to wait between scrolls
});
回答by Alexander C. Harrington
I know that it has been answered a long time ago, but I also found a solution to my specific scenario. The result is a piece of javascript that scrolls to the bottom of the page. It is optimized to reduce waiting time.
我知道很久以前就有人回答了,但我也找到了针对我的特定场景的解决方案。结果是一段滚动到页面底部的javascript。它经过优化以减少等待时间。
It is not written for PhantomJS by default, so that will have to be modified. However, for a beginner or someone who doesn't have root access, an Iframe with injected javascript (run Google Chrome with --disable-javascript parameter) is a good alternative method for scraping a smaller set of ajax pages. The main benefit is that it's easily debuggable, because you have a visual overview of what's going on with your scraper.
默认情况下它不是为 PhantomJS 编写的,因此必须对其进行修改。但是,对于初学者或没有 root 访问权限的人来说,带有注入 javascript 的 iframe(使用 --disable-javascript 参数运行 Google Chrome)是抓取较小的 ajax 页面集的很好的替代方法。主要的好处是它很容易调试,因为您可以直观地了解刮板的情况。
function ScrollForAjax () {
scrollintervals = 50;
scrollmaxtime = 1000;
if(typeof(scrolltime)=="undefined"){
scrolltime = 0;
}
scrolldocheight1 = $(iframeselector).contents().find("body").height();
$("body").scrollTop(scrolldocheight1);
setTimeout(function(){
scrolldocheight2 = $("body").height();
if(scrolltime===scrollmaxtime || scrolltime>scrollmaxtime){
scrolltime = 0;
$("body").scrollTop(0);
ScrapeCurrentPage(iframeselector);
}
else if(scrolldocheight2>scrolldocheight1){
scrolltime = 0;
ScrollForAjax (iframeselector);
}
else if(scrolldocheight1>=scrolldocheight2){
ScrollForAjax (iframeselector);
}
},scrollintervals);
scrolltime += scrollintervals;
}
scrollmaxtime is a timeout variable. Hope this is useful to someone :)
scrollmaxtime 是一个超时变量。希望这对某人有用:)
回答by tfmontague
The "correct" solution didn't work for me. And, from what I've read CasperJS doesn't use window
(but I may be wrong on that), which makes me doubt that window
works.
“正确”的解决方案对我不起作用。而且,从我读到的 CasperJS 没有使用window
(但我可能错了),这让我怀疑它是否window
有效。
The following works for me in the Firefox/Chrome console; but, doesn't work in CasperJS (within casper.evaluate
function).
以下在 Firefox/Chrome 控制台中对我有用;但是,在 CasperJS 中不起作用(在casper.evaluate
函数内)。
$(document).scrollTop($(document).height());
What did work for me in CasperJS was:
在 CasperJS 中对我有用的是:
casper.scrollToBottom();
casper.wait(1000, function waitCb() {
casper.capture("loadedContent.png");
});
Which, also worked when moving casper.capture
into Casper's then
function.
这在casper.capture
进入 Casper 的then
功能时也有效。
However, the above solution won't work on some sites like Twitter; jQuery seems to break the casper.scrollToBottom()
function, and I had to remove the clientScripts
reference to jQuery when working within Twitter.
但是,上述解决方案不适用于某些网站,例如 Twitter;jQuery 似乎破坏了这个casper.scrollToBottom()
功能,我不得不clientScripts
在 Twitter 中工作时删除对 jQuery的引用。
var casper = require('casper').create({
clientScripts: [
// 'jquery.js'
]
});
Some websites (e.g. BoingBoing.net) seem to work fine with jQuery and CasperJS scrollToBottom()
. Not sure why some sites work and others don't.
一些网站(例如 BoingBoing.net)似乎可以很好地使用 jQuery 和 CasperJS scrollToBottom()
。不知道为什么有些网站能用而有些网站不能用。
回答by Suben Saha
The code snippet below work just fine for pinterest. I researched a lot to scrape pinterest without phantomjs but it is impossible to find the infinite scroll trigger link. I think the code below will help other infinite scroll web page to scrape.
下面的代码片段适用于 pinterest。我做了很多研究来在没有 phantomjs 的情况下抓取 pinterest,但无法找到无限滚动触发链接。我认为下面的代码将帮助其他无限滚动网页抓取。
page.open(pageUrl).then(function (status) {
var count = 0;
// Scrolls to the bottom of page
function scroll2btm(){
if(count <500) {
page.evaluate(function(limit) {
window.scrollTo(0, document.body.scrollHeight || document.documentElement.scrollHeight);
return document.getElementsByClassName('pinWrapper').length; //use desired contents(eg. pin) selector for count presence number
}).then(function(c){
count=c;
console.log(count)//print no of content found to check
});
setTimeout(scroll2btm,3000);
}
else { // required number of item found
}
}
scroll2btm();
})