javascript 使用 Node.js 和 XPath 高效解析页面
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25753368/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Performant parsing of pages with Node.js and XPath
提问by polkovnikov.ph
I'm into some web scraping with Node.js. I'd like to use XPath as I can generate it semi-automatically with several sorts of GUI. The problem is that I cannot find a way to do this effectively.
我正在使用 Node.js 进行一些网络抓取。我想使用 XPath,因为我可以使用多种 GUI 半自动生成它。问题是我找不到有效地做到这一点的方法。
jsdom
is extremely slow. It's parsing 500KiB file in a minute or so with full CPU load and a heavy memory footprint.- Popular libraries for HTML parsing (e.g.
cheerio
) neither support XPath, nor expose W3C-compliant DOM. - Effective HTML parsing is, obviously, implemented in WebKit, so using
phantom
orcasper
would be an option, but those require to be running in a special way, not justnode <script>
. I cannot rely on the risk implied by this change. For example, it's much more difficult to find how to runnode-inspector
withphantom
. Spooky
is an option, but it's buggy enough, so that it didn't run at all on my machine.
jsdom
非常慢。它在一分钟左右解析 500KiB 文件,CPU 负载满,内存占用大。- 流行的 HTML 解析库(例如
cheerio
)既不支持 XPath,也不公开符合 W3C 的 DOM。 - 显然,有效的 HTML 解析是在 WebKit 中实现的,因此使用
phantom
或casper
将是一种选择,但那些需要以特殊方式运行,而不仅仅是node <script>
. 我不能依赖这种变化所隐含的风险。例如,找到如何node-inspector
使用phantom
. Spooky
是一个选项,但它有足够的缺陷,因此它根本无法在我的机器上运行。
What's the right way to parse an HTML page with XPath then?
那么用 XPath 解析 HTML 页面的正确方法是什么?
回答by pda
You can do so in several steps.
您可以通过几个步骤来做到这一点。
- Parse HTML with
parse5
. The bad part is that the result is not DOM. Though it's fast enough and W3C-compiant. - Serialize it to XHTML with
xmlserializer
that accepts DOM-like structures ofparse5
as input. - Parse that XHTML again with
xmldom
. Now you finally have that DOM. - The
xpath
library builds uponxmldom
, allowing you to run XPath queries. Be aware that XHTML has its own namespace, and queries like//a
won't work.
- 使用
parse5
. 不好的部分是结果不是 DOM。虽然它足够快并且符合 W3C。 - 将其序列化为XHTML,
xmlserializer
并接受类似 DOM 的结构parse5
作为输入。 - 再次解析该 XHTML
xmldom
。现在你终于拥有了那个 DOM。 - 该
xpath
库基于xmldom
,允许您运行 XPath 查询。请注意,XHTML 有自己的命名空间,像这样的查询是//a
行不通的。
Finally you get something like this.
最后你会得到这样的东西。
const fs = require('mz/fs');
const xpath = require('xpath');
const parse5 = require('parse5');
const xmlser = require('xmlserializer');
const dom = require('xmldom').DOMParser;
(async () => {
const html = await fs.readFile('./test.htm');
const document = parse5.parse(html.toString());
const xhtml = xmlser.serializeToString(document);
const doc = new dom().parseFromString(xhtml);
const select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});
const nodes = select("//x:a/@href", doc);
console.log(nodes);
})();
回答by mb21
Libxmljsis currently the fastest implementation (something like a benchmark) since it's only bindings to the LibXMLC-library which supports XPath 1.0 queries:
Libxmljs目前是最快的实现(类似于基准测试),因为它仅绑定到支持 XPath 1.0 查询的LibXMLC 库:
var libxmljs = require("libxmljs");
var xmlDoc = libxmljs.parseXml(xml);
// xpath queries
var gchild = xmlDoc.get('//grandchild');
However, you need to sanitize your HTML first and convert it to proper XML. For that you could either use the HTMLTidycommand line utility (tidy -q -asxml input.html
), or if you want it to keep node-only, something like xmlserializershould do the trick.
但是,您需要先清理 HTML 并将其转换为正确的 XML。为此,您可以使用HTMLTidy命令行实用程序 ( tidy -q -asxml input.html
),或者如果您希望它仅保留节点,则可以使用xmlserializer之类的工具。
回答by Soren
I have just started using npm install htmlstrip-native
which uses a native implementationto parse and extract the relevant html parts. It is claiming to be 50 times faster than the pure js implementation (I have not verified that claim).
我刚刚开始使用npm install htmlstrip-native
它使用本机实现来解析和提取相关的 html 部分。它声称比纯 js 实现快 50 倍(我尚未验证该说法)。
Depending on your needs you can use html-strip directly, or lift the code and bindings to make you own C++ used internally in htmlstrip-native
根据你的需要,你可以直接使用 html-strip,或者提升代码和绑定,让你自己在 htmlstrip-native 内部使用的 C++
If you want to use xpath, then use the wrapper already avaialble here; https://www.npmjs.org/package/xpath
如果你想使用 xpath,那么使用这里已经可用的包装器; https://www.npmjs.org/package/xpath
回答by rchipka
I think Osmosisis what you're looking for.
我认为Osmosis就是你要找的。
- Uses native libxml C bindings
- Supports CSS 3.0 and XPath 1.0 selector hybrids
- Sizzle selectors, Slick selectors, and more
- No large dependencies like jQuery, cheerio, or jsdom
HTML parser features
- Fast parsing
- Very fast searching
- Small memory footprint
HTML DOM features
- Load and search ajax content
- DOM interaction and events
- Execute embedded and remote scripts
- Execute code in the DOM
- 使用本机 libxml C 绑定
- 支持 CSS 3.0 和 XPath 1.0 选择器混合
- Sizzle 选择器、Slick 选择器等
- 没有像 jQuery、cheerio 或 jsdom 这样的大依赖
HTML 解析器功能
- 快速解析
- 搜索速度非常快
- 内存占用小
HTML DOM 特性
- 加载和搜索ajax内容
- DOM 交互和事件
- 执行嵌入式和远程脚本
- 在 DOM 中执行代码
osmosis.get(url)
.find('//div[@class]/ul[2]/li')
.then(function () {
count++;
})
.done(function () {
assert.ok(count == 2);
assert.done();
});
回答by Hieu Van
With just one line, you can do it with xpath-html
:
只需一行,您就可以使用xpath-html
:
const xpath = require("xpath-html");
const node = xpath.fromPageSource(html).findElement("//*[text()='Made with love by']");
回答by pateheo
There might be never a right way to parse HTML pages. A very first review on web scraping and crawling shows me that Scrapycan be a good candidate for your need. It accepts both CSS and XPath selectors. In the realm of Node.js, we have a pretty new module node-osmosis. This module is built upon libxmljs so that it is supposed to support both CSS and XPath although I did not find any example using XPath.
可能永远没有正确的方法来解析 HTML 页面。对网页抓取和抓取的第一篇评论告诉我Scrapy可以成为满足您需求的理想选择。它接受 CSS 和 XPath 选择器。在 Node.js 领域,我们有一个非常新的模块node-osmosis。这个模块建立在 libxmljs 之上,因此它应该支持 CSS 和 XPath,尽管我没有找到任何使用 XPath 的例子。