javascript 使用 PhantomJS 提取 html 和文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18453993/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-27 11:59:29  来源:igfitidea点击:

Use PhantomJS to extract html and text

javascripthtmlparsingdomphantomjs

提问by Jay Romuald

I try to extract all the text content of a page (because it doesn't work with Simpledomparser)

我尝试提取页面的所有文本内容(因为它不适用于 Simpledomparser)

I try to modify this simple example from the manual

我尝试从手册中修改这个简单的例子

var page = require('webpage').create();
console.log('The default user agent is ' + page.settings.userAgent);
page.settings.userAgent = 'SpecialAgent';
page.open('http://www.httpuseragent.org', function (status) {
    if (status !== 'success') {
        console.log('Unable to access network');
    } else {
        var ua = page.evaluate(function () {
            return document.getElementById('myagent').textContent;
        });
        console.log(ua);
    }
    phantom.exit();
});

I try to change

我尝试改变

return document.getElementById('myagent').textContent;

to

return document.textContent;

This doesn't work.

这不起作用。

What's the right way to do this simple thing?

做这个简单的事情的正确方法是什么?

回答by justageek

This version of your script should return the entire contents of the page:

这个版本的脚本应该返回页面的全部内容:

var page = require('webpage').create();
page.settings.userAgent = 'SpecialAgent';
page.open('http://www.httpuseragent.org', function (status) {
    if (status !== 'success') {
        console.log('Unable to access network');
    } else {
        var ua = page.evaluate(function () {
            return document.getElementsByTagName('html')[0].outerHTML;
        });
        console.log(ua);
    }
    phantom.exit();
});

回答by Artjom B.

There are multiple ways to retrieve the page content as a string:

有多种方法可以将页面内容检索为字符串:

  • page.contentgives the complete source including the markup (<html>) and doctype (<!DOCTYPE html>),

  • document.documentElement.outerHTML(via page.evaluate) gives the complete source including the markup (<html>), but without doctype,

  • document.documentElement.textContent(via page.evaluate) gives the cumulative text content of the complete document including inline CSS & JavaScript, but without markup,

  • document.documentElement.innerText(via page.evaluate) gives the cumulative text content of the complete document excluding inline CSS & JavaScript and without markup.

  • page.content给出完整的源代码,包括标记 ( <html>) 和文档类型 ( <!DOCTYPE html>),

  • document.documentElement.outerHTML(via page.evaluate) 提供包括标记 ( <html>)在内的完整源代码,但没有文档类型,

  • document.documentElement.textContent(via page.evaluate) 给出完整文档的累积文本内容,包括内联 CSS 和 JavaScript,但没有标记,

  • document.documentElement.innerText(via page.evaluate) 给出完整文档的累积文本内容,不包括内联 CSS 和 JavaScript,没有标记。

document.documentElementcan be exchanged by an element or query of your choice.

document.documentElement可以通过您选择的元素或查询进行交换。

回答by Cybermaxs

To extract the text content of the page, you can try thisreturn document.body.textContent;but I'm not sure the result will be usable.

要提取页面的文本内容,您可以尝试此操作,return document.body.textContent;但我不确定结果是否可用。

回答by evolutionise

Having encountered this question while trying to solve a similar problem, I ended up adapting a solution from this questionlike so:

在尝试解决类似问题时遇到了这个问题,我最终从这个问题中调整了一个解决方案,如下所示:

var fs = require('fs');
var file_h = fs.open('header.html', 'r');
var line = file_h.readLine();
var header = "";

while(!file_h.atEnd()) {

    line = file_h.readLine(); 
    header += line;

}
console.log(header);

file_h.close();
phantom.exit();

This gave me a string with the read-in HTML file that was sufficient for my purposes, and hopefully may help others who came across this.

这给了我一个带有读入 HTML 文件的字符串,足以满足我的目的,希望可以帮助遇到此问题的其他人。

The question seemed ambiguous (was it the entire content of the file required, or just the "text" aka Strings?) so this is one possible solution.

这个问题似乎模棱两可(是所需文件的全部内容,还是只是“文本”又名字符串?)所以这是一种可能的解决方案。