javascript 如何使用 Phantomjs 向下滚动以加载动态内容

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16561582/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-27 05:08:48  来源:igfitidea点击:

How to scroll down with Phantomjs to load dynamic content

javascriptdomweb-scrapingscreen-scrapingphantomjs

提问by Puneet Saini

I am trying to scrape links from a page that generates content dynamically as the user scroll down to the bottom (infinite scrolling). I have tried doing different things with Phantomjs but not able to gather links beyond first page. Let say the element at the bottom which loads content has class .has-more-items. It is available until final content is loaded while scrolling and then becomes unavailable in DOM (display:none). Here are the things I have tried-

当用户向下滚动到底部(无限滚动)时,我试图从动态生成内容的页面中抓取链接。我尝试用 Phantomjs 做不同的事情,但无法收集第一页以外的链接。假设加载内容的底部元素具有 class .has-more-items。它在滚动时加载最终内容之前可用,然后在 DOM 中变得不可用(显示:无)。这是我尝试过的事情-

  • Setting viewportSize to a large height right after var page = require('webpage').create();
  • 之后立即将 viewportSize 设置为较大的高度 var page = require('webpage').create();

page.viewportSize = { width: 1600, height: 10000, };

page.viewportSize = { 宽度:1600,高度:10000,};

  • Using page.scrollPosition = { top: 10000, left: 0 }inside page.openbut have no effect like-
  • page.scrollPosition = { top: 10000, left: 0 }在内部使用page.open但没有效果,例如-
page.open('http://example.com/?q=houston', function(status) {
   if (status == "success") {
      page.scrollPosition = { top: 10000, left: 0 };  
   }
});
page.open('http://example.com/?q=houston', function(status) {
   if (status == "success") {
      page.scrollPosition = { top: 10000, left: 0 };  
   }
});
  • Also tried putting it inside page.evaluatefunction but that gives
  • 还尝试将它放在page.evaluate函数中,但这给出了

Reference error: Can't find variable page

参考错误:找不到变量页面

  • Tried using jQuery and JS code inside page.evaluateand page.openbut to no avail-
  • 尝试在内部使用 jQuery 和 JS 代码page.evaluatepage.open但无济于事-

$("html, body").animate({ scrollTop: $(document).height() }, 10, function() { //console.log('check for execution'); });

$("html, body").animate({ scrollTop: $(document).height() }, 10, function() { //console.log('检查执行'); });

as it is and also inside document.ready. Similarly for JS code-

照原样,也在里面document.ready。同样对于 JS 代码 -

window.scrollBy(0,10000)

as it is and also inside window.onload

照原样,也在里面 window.onload

I am really struck on it for 2 days now and not able to find a way. Any help or hint would be appreciated.

我真的对它感到震惊了 2 天,但无法找到方法。任何帮助或提示将不胜感激。

Update

更新

I have found a helpful piece of code at https://groups.google.com/forum/?fromgroups=#!topic/phantomjs/8LrWRW8ZrA0

我在https://groups.google.com/forum/?fromgroups=#!topic/phantomjs/8LrWRW8ZrA0找到了一段有用的代码

var hitRockBottom = false; while (!hitRockBottom) {
    // Scroll the page (not sure if this is the best way to do so...)
    page.scrollPosition = { top: page.scrollPosition + 1000, left: 0 };

    // Check if we've hit the bottom
    hitRockBottom = page.evaluate(function() {
        return document.querySelector(".has-more-items") === null;
    }); }

Where .has-more-itemsis the element class I want to access which is available at the bottom of the page initially and as we scroll down, it moves further down until all data is loaded and then becomes unavailable.

.has-more-items我想访问的元素类在哪里,它最初在页面底部可用,当我们向下滚动时,它会进一步向下移动,直到所有数据都加载完毕,然后变得不可用。

However, when I tested it is clear that it is running into infinite loops without scrolling down (I render pictures to check). I have tried to replace page.scrollPosition = { top: page.scrollPosition + 1000, left: 0 };with codes from below as well (one at a time)

但是,当我测试时,很明显它在不向下滚动的情况下进入无限循环(我渲染图片以进行检查)。我也尝试page.scrollPosition = { top: page.scrollPosition + 1000, left: 0 };用下面的代码替换(一次一个)

window.document.body.scrollTop = '1000';
location.href = ".has-more-items";
page.scrollPosition = { top: page.scrollPosition + 1000, left: 0 };
document.location.href=".has-more-items";

But nothing seems to work.

但似乎没有任何效果。

采纳答案by Jo?o Pesce

Found a way to do it and tried to adapt to your situation. I didn't test the best way of finding the bottom of the page because I had a different context, but check it out. The problem is that you have to wait a little for the page to load out and javascript works asynchronously so you have to use setIntervalor setTimeout(see).

找到了一种方法,并尝试适应您的情况。我没有测试找到页面底部的最佳方法,因为我有不同的上下文,但请检查一下。问题是您必须稍等片刻才能加载页面,并且 javascript 异步工作,因此您必须使用setIntervalor setTimeout请参阅 参考资料)。

page.open('http://example.com/?q=houston', function () {

  // Checks for bottom div and scrolls down from time to time
  window.setInterval(function() {
      // Checks if there is a div with class=".has-more-items" 
      // (not sure if this is the best way of doing it)
      var count = page.content.match(/class=".has-more-items"/g);

      if(count === null) { // Didn't find
        page.evaluate(function() {
          // Scrolls to the bottom of page
          window.document.body.scrollTop = document.body.scrollHeight;
        });
      }
      else { // Found
        // Do what you want
        ...
        phantom.exit();
      }
  }, 500); // Number of milliseconds to wait between scrolls

});

回答by Alexander C. Harrington

I know that it has been answered a long time ago, but I also found a solution to my specific scenario. The result is a piece of javascript that scrolls to the bottom of the page. It is optimized to reduce waiting time.

我知道很久以前就有人回答了,但我也找到了针对我的特定场景的解决方案。结果是一段滚动到页面底部的javascript。它经过优化以减少等待时间。

It is not written for PhantomJS by default, so that will have to be modified. However, for a beginner or someone who doesn't have root access, an Iframe with injected javascript (run Google Chrome with --disable-javascript parameter) is a good alternative method for scraping a smaller set of ajax pages. The main benefit is that it's easily debuggable, because you have a visual overview of what's going on with your scraper.

默认情况下它不是为 PhantomJS 编写的,因此必须对其进行修改。但是,对于初学者或没有 root 访问权限的人来说,带有注入 javascript 的 iframe(使用 --disable-javascript 参数运行 Google Chrome)是抓取较小的 ajax 页面集的很好的替代方法。主要的好处是它很容易调试,因为您可以直观地了解刮板的情况。

function ScrollForAjax () {

    scrollintervals = 50;
    scrollmaxtime = 1000;

    if(typeof(scrolltime)=="undefined"){
        scrolltime = 0;
    }

    scrolldocheight1 = $(iframeselector).contents().find("body").height();

    $("body").scrollTop(scrolldocheight1);
    setTimeout(function(){

        scrolldocheight2 = $("body").height();

        if(scrolltime===scrollmaxtime || scrolltime>scrollmaxtime){
            scrolltime = 0;
            $("body").scrollTop(0);
            ScrapeCurrentPage(iframeselector);
        }

        else if(scrolldocheight2>scrolldocheight1){
            scrolltime = 0;
            ScrollForAjax (iframeselector);
        }

        else if(scrolldocheight1>=scrolldocheight2){
            ScrollForAjax (iframeselector);
        }

    },scrollintervals);

    scrolltime += scrollintervals;
}

scrollmaxtime is a timeout variable. Hope this is useful to someone :)

scrollmaxtime 是一个超时变量。希望这对某人有用:)

回答by tfmontague

The "correct" solution didn't work for me. And, from what I've read CasperJS doesn't use window(but I may be wrong on that), which makes me doubt that windowworks.

“正确”的解决方案对我不起作用。而且,从我读到的 CasperJS 没有使用window(但我可能错了),这让我怀疑它是否window有效。

The following works for me in the Firefox/Chrome console; but, doesn't work in CasperJS (within casper.evaluatefunction).

以下在 Firefox/Chrome 控制台中对我有用;但是,在 CasperJS 中不起作用(在casper.evaluate函数内)。

$(document).scrollTop($(document).height());

What did work for me in CasperJS was:

在 CasperJS 中对我有用的是:

casper.scrollToBottom();
casper.wait(1000, function waitCb() {
  casper.capture("loadedContent.png");
});

Which, also worked when moving casper.captureinto Casper's thenfunction.

这在casper.capture进入 Casper 的then功能时也有效。

However, the above solution won't work on some sites like Twitter; jQuery seems to break the casper.scrollToBottom()function, and I had to remove the clientScriptsreference to jQuery when working within Twitter.

但是,上述解决方案不适用于某些网站,例如 Twitter;jQuery 似乎破坏了这个casper.scrollToBottom()功能,我不得不clientScripts在 Twitter 中工作时删除对 jQuery的引用。

var casper = require('casper').create({
    clientScripts: [
       // 'jquery.js'
    ]
});

Some websites (e.g. BoingBoing.net) seem to work fine with jQuery and CasperJS scrollToBottom(). Not sure why some sites work and others don't.

一些网站(例如 BoingBoing.net)似乎可以很好地使用 jQuery 和 CasperJS scrollToBottom()。不知道为什么有些网站能用而有些网站不能用。

回答by Suben Saha

The code snippet below work just fine for pinterest. I researched a lot to scrape pinterest without phantomjs but it is impossible to find the infinite scroll trigger link. I think the code below will help other infinite scroll web page to scrape.

下面的代码片段适用于 pinterest。我做了很多研究来在没有 phantomjs 的情况下抓取 pinterest,但无法找到无限滚动触发链接。我认为下面的代码将帮助其他无限滚动网页抓取。

page.open(pageUrl).then(function (status) {
              var count = 0;
                // Scrolls to the bottom of page
              function scroll2btm(){
                if(count <500) {
                  page.evaluate(function(limit) {
                    window.scrollTo(0, document.body.scrollHeight || document.documentElement.scrollHeight);
                    return document.getElementsByClassName('pinWrapper').length; //use desired contents(eg. pin) selector for count presence number
                  }).then(function(c){
                    count=c;
                    console.log(count)//print no of content found to check
                  });
                  setTimeout(scroll2btm,3000);
                }
              else { // required number of item found
                }
              }
              scroll2btm();
            })