javascript 使用 PhantomJS 提取 html 和文本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18453993/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Use PhantomJS to extract html and text
提问by Jay Romuald
I try to extract all the text content of a page (because it doesn't work with Simpledomparser)
我尝试提取页面的所有文本内容(因为它不适用于 Simpledomparser)
I try to modify this simple example from the manual
我尝试从手册中修改这个简单的例子
var page = require('webpage').create();
console.log('The default user agent is ' + page.settings.userAgent);
page.settings.userAgent = 'SpecialAgent';
page.open('http://www.httpuseragent.org', function (status) {
if (status !== 'success') {
console.log('Unable to access network');
} else {
var ua = page.evaluate(function () {
return document.getElementById('myagent').textContent;
});
console.log(ua);
}
phantom.exit();
});
I try to change
我尝试改变
return document.getElementById('myagent').textContent;
to
到
return document.textContent;
This doesn't work.
这不起作用。
What's the right way to do this simple thing?
做这个简单的事情的正确方法是什么?
回答by justageek
This version of your script should return the entire contents of the page:
这个版本的脚本应该返回页面的全部内容:
var page = require('webpage').create();
page.settings.userAgent = 'SpecialAgent';
page.open('http://www.httpuseragent.org', function (status) {
if (status !== 'success') {
console.log('Unable to access network');
} else {
var ua = page.evaluate(function () {
return document.getElementsByTagName('html')[0].outerHTML;
});
console.log(ua);
}
phantom.exit();
});
回答by Artjom B.
There are multiple ways to retrieve the page content as a string:
有多种方法可以将页面内容检索为字符串:
page.content
gives the complete source including the markup (<html>
) and doctype (<!DOCTYPE html>
),document.documentElement.outerHTML
(viapage.evaluate
) gives the complete source including the markup (<html>
), but without doctype,document.documentElement.textContent
(viapage.evaluate
) gives the cumulative text content of the complete document including inline CSS & JavaScript, but without markup,document.documentElement.innerText
(viapage.evaluate
) gives the cumulative text content of the complete document excluding inline CSS & JavaScript and without markup.
page.content
给出完整的源代码,包括标记 (<html>
) 和文档类型 (<!DOCTYPE html>
),document.documentElement.outerHTML
(viapage.evaluate
) 提供包括标记 (<html>
)在内的完整源代码,但没有文档类型,document.documentElement.textContent
(viapage.evaluate
) 给出完整文档的累积文本内容,包括内联 CSS 和 JavaScript,但没有标记,document.documentElement.innerText
(viapage.evaluate
) 给出完整文档的累积文本内容,不包括内联 CSS 和 JavaScript,没有标记。
document.documentElement
can be exchanged by an element or query of your choice.
document.documentElement
可以通过您选择的元素或查询进行交换。
回答by Cybermaxs
To extract the text content of the page, you can try thisreturn document.body.textContent;
but I'm not sure the result will be usable.
要提取页面的文本内容,您可以尝试此操作,return document.body.textContent;
但我不确定结果是否可用。
回答by evolutionise
Having encountered this question while trying to solve a similar problem, I ended up adapting a solution from this questionlike so:
在尝试解决类似问题时遇到了这个问题,我最终从这个问题中调整了一个解决方案,如下所示:
var fs = require('fs');
var file_h = fs.open('header.html', 'r');
var line = file_h.readLine();
var header = "";
while(!file_h.atEnd()) {
line = file_h.readLine();
header += line;
}
console.log(header);
file_h.close();
phantom.exit();
This gave me a string with the read-in HTML file that was sufficient for my purposes, and hopefully may help others who came across this.
这给了我一个带有读入 HTML 文件的字符串,足以满足我的目的,希望可以帮助遇到此问题的其他人。
The question seemed ambiguous (was it the entire content of the file required, or just the "text" aka Strings?) so this is one possible solution.
这个问题似乎模棱两可(是所需文件的全部内容,还是只是“文本”又名字符串?)所以这是一种可能的解决方案。