javascript 使用 PhantomJS 进行网页抓取

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27472057/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-28 07:31:32  来源:igfitidea点击:

Web scraping using PhantomJS

javascriptweb-scrapingphantomjs

提问by Trancey

Is there a way to execute all the JavaScripts in a webpage exactly like the browser without specifying which function to execute? In most of the examples that I saw these seem to specify which portion of JavaScript you want to execute from the scraped webpage. I need to scrape all of the contents and execute all of the JavaScripts just like a browser and get me the final executed code which we can see using google inspect?

有没有一种方法可以像浏览器一样执行网页中的所有 JavaScript,而无需指定要执行的函数?在我看到的大多数示例中,这些似乎指定了要从抓取的网页中执行的 JavaScript 部分。我需要像浏览器一样抓取所有内容并执行所有 JavaScript 并获取最终执行的代码,我们可以使用谷歌检查看到这些代码?

I am sure there must be some way, but the example code from PhantomJS did not seem to have any example addressing this.

我确信一定有某种方法,但是 PhantomJS 的示例代码似乎没有任何示例来解决这个问题。

采纳答案by Artjom B.

You don't specify what gets executed from the page with PhantomJS. You open the page with PhantomJS and all JavaScript that is executed in Chrome or Firefox is also executed in PhantomJS. It is a full browser without a "head".

您没有使用 PhantomJS 指定从页面执行的内容。你用 PhantomJS 打开页面,所有在 Chrome 或 Firefox 中执行的 JavaScript 也在 PhantomJS 中执行。它是一个没有“头”的完整浏览器。

There are some differences though. Clicking a download link will not trigger a download. The rendering engine which PhantomJS 1.x is based upon is nearly 4 years old, so some pages are simply rendered differently, because PhantomJS 1.x might not support that feature. (PhantomJS 2 is on the way and now in unofficial "alpha" status)

虽然有一些差异。单击下载链接不会触发下载。PhantomJS 1.x 所基于的渲染引擎已有近 4 年的历史,所以有些页面的渲染方式不同,因为 PhantomJS 1.x 可能不支持该功能。(PhantomJS 2 正在开发中,现在处于非官方的“alpha”状态)

So you need to script every interaction that a user is doing on the page with JavaScript or CoffeeScript. You don't call page functions. You manipulate DOM elements to simulate a user interacting with the page in the browser. This needs to be done in such a crudeway, because the PhantomJS API doesn't provide high-level user-like functions. If you want those, you have to look at CasperJSwhich is built on top of PhantomJS/SlimerJS.

因此,您需要使用 JavaScript 或 CoffeeScript 编写用户在页面上进行的每次交互的脚本。您不调用页面函数。您可以操作 DOM 元素来模拟用户与浏览器中的页面进行交互。这需要以如此粗略的方式完成,因为 PhantomJS API 不提供高级用户类功能。如果你想这些,你一定要看CasperJS这是建立在PhantomJS / SlimerJS的顶部。

There you actually have functions like click, wait, fetchText, etc.

在那里,你实际上有功能,如clickwaitfetchText,等。

回答by hqm

This will work, put this in a file named "scrape.js" and execute it with phantomjs. Pass your url as the first arg

这将起作用,将其放入名为“scrape.js”的文件中并使用 phantomjs 执行它。将您的网址作为第一个参数传递

// Usage: phantomjs scrape.js http://your.url.to.scrape.com
"use strict";
var sys = require("system"),
    page = require("webpage").create(),
    logResources = false,
    url = sys.args[1]

//console.log('fetch from', url);

function printArgs() {
    var i, ilen;
    for (i = 0, ilen = arguments.length; i < ilen; ++i) {
        console.log("    arguments[" + i + "] = " + JSON.stringify(arguments[i]));
    }
    console.log("");
}



////////////////////////////////////////////////////////////////////////////////


page.onLoadFinished = function() {
   page.evaluate(function() {
       console.log(document.body.innerHTML);
     });
};
// window.console.log(msg);
page.onConsoleMessage = function() {
    printArgs.apply(this, arguments);
    phantom.exit(0);
};



////////////////////////////////////////////////////////////////////////////////

page.open(url);