javascript 我可以使用 phantomjs/casperjs 获取原始页面源(与当前 DOM)吗?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24069722/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-28 02:07:27  来源:igfitidea点击:

Can I get the original page source (vs current DOM) with phantomjs/casperjs?

javascriptphantomjscasperjs

提问by supercoco

I am trying to get the original source for a particular web page.

我正在尝试获取特定网页的原始来源。

The page executes some scripts that modify the DOM as soon as it loads. I would like to get the source before any script or user changes any object in the document.

该页面会在加载后立即执行一些修改 DOM 的脚本。我想在任何脚本或用户更改文档中的任何对象之前获取源代码。

With Chrome or Firefox (and probably most browsers) I can either look at the DOM (debug utility F12) or look at the original source (right-click, view source). The latter is what I want to accomplish.

使用 Chrome 或 Firefox(可能还有大多数浏览器),我可以查看 DOM(调试实用程序 F12)或查看原始源代码(右键单击,查看源代码)。后者是我想要完成的。

Is it possible to do this with phantomjs/casperjs?

是否可以使用 phantomjs/casperjs 来做到这一点?

Before getting to the page I have to log in. This is working fine with casperjs. If I browse to the page and render the results I know I am on the right page.

在进入页面之前,我必须登录。这在 casperjs 中工作正常。如果我浏览到页面并呈现结果,我知道我在正确的页面上。

casper.thenOpen('http://'+customUrl, function(response) {
    this.page.render('example.png'); // *** Renders correct page (current DOM) ***
    console.log(this.page.content); // *** Gets current DOM ***
    casper.download('view-source:'+customUrl, 'b.html', 'GET'); // *** Blank page ***
    console.log(this.getHTML()); // *** Gets current DOM ***
    this.debugPage(); // *** Gets current DOM ***
    utils.dump(response); // *** No BODY ***
    casper.download('http://'+customUrl, 'a.html', 'GET');  // *** Not logged in ?! ***
});

I've tried this.download(url, 'a.html')but it doesn't seem to share the same context since it returns HTML as if I was not logged in, even if I run with cookies casperjs test.casper.js --cookies-file=cookies.txt.

我已经尝试过,this.download(url, 'a.html')但它似乎没有共享相同的上下文,因为它返回 HTML 就好像我没有登录一样,即使我使用 cookies 运行casperjs test.casper.js --cookies-file=cookies.txt

I believe I should keep analyzing this option.

我相信我应该继续分析这个选项。



I have also tried casper.open('view-source:url')instead of casper.open('http://url')but it seems it doesn't recognize the url since I just get a blank page.

我也试过casper.open('view-source:url')代替,casper.open('http://url')但它似乎无法识别网址,因为我只是得到一个空白页面。

I have looked at the raw HTTP Response I get from the server with a utility I have and the body of this message (which is HTML) is what I need but when the page loads in the browser the DOM has already been modified.

我已经使用我拥有的实用程序查看了从服务器获得的原始 HTTP 响应,并且该消息的正文(即 HTML)是我所需要的,但是当页面在浏览器中加载时,DOM 已经被修改。

I tried:

我试过:

casper.thenOpen('http://'+url, function(response) {
    ...
}

But the responseobject only contains the headers and some other information but not the body.

但是该response对象只包含标题和其他一些信息,而不包含正文。



I also tried with the event onResourceRequested.

我也尝试过事件onResourceRequested

The idea is to abort the download of any resource needed by a specific web page (the referer).

这个想法是中止特定网页(引用者)所需的任何资源的下载。

onResourceRequested: function(casperObj, requestData, networkRequest) {
for (var i=0; i < requestData.headers.length; i++) {
    var obj = requestData.headers[i];
    if (obj.name === "Referer" && obj.value === 'http://'+customUrl) {
        networkRequest.abort();
        break;
    }
}

Unfortunately the script that modifies the DOM initially seems to be inline the main HTML page (or this code is not doing what I would like it to do).

不幸的是,最初修改 DOM 的脚本似乎是内联主 HTML 页面(或者这段代码没有做我希望它做的事情)。



?Any ideas?

?有任何想法吗?

Here is the full code:

这是完整的代码:

phantom.casperTest = true;
phantom.cookiesEnabled = true;

var utils = require('utils');
var casper = require('casper').create({
    clientScripts:  [],
    pageSettings: {
        loadImages:  false,
        loadPlugins: false,
        javascriptEnabled: true,
        webSecurityEnabled: false
    },
    logLevel: "error",
    verbose: true
});

casper.userAgent('Mozilla/5.0 (Macintosh; Intel Mac OS X)');

casper.start('http://www.xxxxxxx.xxx/login');

casper.waitForSelector('input#login',
    function() {
        this.evaluate(function(customLogin, customPassword) {
            document.getElementById("login").value = customLogin;
            document.getElementById("password").value = customPassword;
            document.getElementById("button").click();
        }, {
            "customLogin": customLogin,
            "customPassword": customPassword
        });
    },
    function() {
        console.log('Can't login.');
    },
    15000
);

casper.waitForSelector('div#home',
    function() {
        console.log('Login successfull.');
    },
    function() {
        console.log('Login failed.');
    },
    15000
);

casper.thenOpen('http://'+customUrl, function(response) {
    this.page.render('example.png'); // *** Renders correct page (current DOM) ***
    console.log(this.page.content); // *** Gets current DOM ***
    casper.download('view-source:'+customUrl, 'b.html', 'GET'); // *** Blank page ***
    console.log(this.getHTML()); // *** Gets current DOM ***
    this.debugPage(); // *** Gets current DOM ***
    utils.dump(response); // *** No BODY ***
    casper.download('http://'+customUrl, 'a.html', 'GET');  // *** Not logged in ?! ***
});

回答by Fanch

Hum, did you try using some events? For example :

嗯,你有没有尝试使用一些事件?例如 :

casper.on('load.started', function(resource) {
    casper.echo(casper.getPageContent());
});

I think it won't work, try it anyway.

我觉得不行,还是试试吧。

The problem is : you can't do it in a normal casperJS step because the scripts on your page are already executed. It could work if we could bind the on-DOM-Ready event, or have a specific casper event like that. Problem : the page must be loaded to send some js from Casper to the DOM environment. So binding onready isn't possible (I don't see how). I think with phantom we can scrape DATA after the load event, so only when the page is rendered.

问题是:您无法在正常的 casperJS 步骤中执行此操作,因为您页面上的脚本已经执行。如果我们可以绑定 on-DOM-Ready 事件,或者有一个像这样的特定 casper 事件,它就可以工作。问题:必须加载页面才能将一些 js 从 Casper 发送到 DOM 环境。所以绑定 onready 是不可能的(我不知道如何)。我认为使用 phantom 我们可以在加载事件之后抓取数据,所以只有在页面呈现时。

So if it's not possible to hack it with the events and maybe some delay, your only solution is to block the scripts which modify your DOM.

因此,如果不可能用事件来破解它,并且可能会有一些延迟,那么您唯一的解决方案是阻止修改 DOM 的脚本。

There is still the phantomJS option, you use it : in casper :

仍然有 phantomJS 选项,您可以使用它:在 casper 中:

casper.pageSettings.javascriptEnabled = false;

The problem is you need the js enabled to get back the data, so it can't work... :p Yeah useless comment ! :)

问题是你需要启用 js 来取回数据,所以它不能工作...... :p 是的,没用的评论!:)

Otherwise you have to block the wanted ressource/script which modify the DOM using events.

否则,您必须阻止使用事件修改 DOM 的所需资源/脚本。

Or you could use the resource.receivedevent to scrape the data wanted before the specific resources modifing DOM appear.

或者,您可以使用该resource.received事件在修改 DOM 的特定资源出现之前抓取所需的数据。

In fact I don't think it's possible because if you create a step which get back some data from page just beforespecific ressources appear, the time your step is executed, the ressources will have load. It would be necessary to freeze the following ressources while your step is scraping the data.

事实上,我认为这是不可能的,因为如果您创建一个步骤,特定资源出现之前从页面取回一些数据,那么在执行您的步骤时,资源将加载。在您的步骤抓取数据时,有必要冻结以下资源。

Don't know how to do it though, but these events could help you :

虽然不知道该怎么做,但这些事件可以帮助你:

casper.on('resource.requested', function(request) {
    console.log(" request " + request.url);
});

casper.on('resource.received', function(resource) {
    console.log(resource.url);
});

casper.on('resource.error',function (request) {
    this.echo('[res : id and url + error description] <-- ' + request.id + ' ' + request.url + ' ' + request.errorString);
});

See also How do you Disable css in CasperJS?. The solution which would work : you identify the scripts and block them. But if you need them, well I don't know, it's a good question. Maybe we could defer the execution of a specific script. I don't think Casper and phantom easily permit that.The only useful option is abort(), give us this option : timeout("time -> ms")!

另请参阅如何在 CasperJS 中禁用 css?. 可行的解决方案:您识别脚本并阻止它们。但如果你需要它们,我不知道,这是一个很好的问题。也许我们可以推迟特定脚本的执行。我认为 Casper 和 phantom 不会轻易允许这样做。唯一有用的选项是abort(),给我们这个选项:timeout("time -> ms")

onResourceRequested

已请求资源

Here a similar question : Injecting script before other

这里有一个类似的问题:在其他之前注入脚本

回答by Artjom B.

As Fanch pointed out, it seems it's not possible to do this. If you are able to do two requests, then this gets easy. Simply do one request with JavaScript enabled and one without, so you can scrape the page source and compare it.

正如范奇指出的那样,似乎不可能做到这一点。如果您能够执行两个请求,那么这将变得容易。只需在启用 JavaScript 和不启用 JavaScript 的情况下执行一项请求,您就可以抓取页面源并进行比较。

casper
    .then(function(){
        this.options.pageSettings.javascriptEnabled = false;
    })
    .thenOpen(url, function(){
        this.echo("before JavaScript");
        this.echo(this.getHTML());
    })
    .then(function(){
        this.options.pageSettings.javascriptEnabled = true;
    })
    .thenOpen(url, function(){
        this.echo("before JavaScript");
        this.echo(this.getHTML());
    });

You can change the order according to your needs. If you're already on a page that you want to have the original markup of, then you can use casper.getCurrentUrl()to get the current URL:

您可以根据需要更改顺序。如果您已经在要使用原始标记的页面上,则可以使用casper.getCurrentUrl()获取当前 URL:

casper
    .then(function(){
        // submit or whatever
    })
    .thenOpen(url, function(){
        this.echo("after JavaScript");
        this.echo(this.getHTML());
        this.options.pageSettings.javascriptEnabled = false;

        this.thenOpen(this.getCurrentUrl(), function(){
            this.echo("before JavaScript");
            this.echo(this.getHTML());
        })
    });

回答by the binary

Regarding the docsyou can use #debugPage()to get the content of the current page.

关于可用于获取当前页面内容的文档#debugPage()

casper.userAgent('Mozilla/5.0 (Macintosh; Intel Mac OS X)');

casper.start('http://www.xxxxxxx.xxx/login');

casper.waitForSelector('input#login', ... );

casper.then(function() {
  this.debugHTML();
});

casper.run();

regards david

问候大卫