javascript 使用 Node.js 和 XPath 高效解析页面

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25753368/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-28 04:57:04  来源:igfitidea点击:

Performant parsing of pages with Node.js and XPath

javascripthtmlnode.jsxpathphantomjs

提问by polkovnikov.ph

I'm into some web scraping with Node.js. I'd like to use XPath as I can generate it semi-automatically with several sorts of GUI. The problem is that I cannot find a way to do this effectively.

我正在使用 Node.js 进行一些网络抓取。我想使用 XPath,因为我可以使用多种 GUI 半自动生成它。问题是我找不到有效地做到这一点的方法。

  1. jsdomis extremely slow. It's parsing 500KiB file in a minute or so with full CPU load and a heavy memory footprint.
  2. Popular libraries for HTML parsing (e.g. cheerio) neither support XPath, nor expose W3C-compliant DOM.
  3. Effective HTML parsing is, obviously, implemented in WebKit, so using phantomor casperwould be an option, but those require to be running in a special way, not just node <script>. I cannot rely on the risk implied by this change. For example, it's much more difficult to find how to run node-inspectorwith phantom.
  4. Spookyis an option, but it's buggy enough, so that it didn't run at all on my machine.
  1. jsdom非常慢。它在一分钟左右解析 500KiB 文件,CPU 负载满,内存占用大。
  2. 流行的 HTML 解析库(例如cheerio)既不支持 XPath,也不公开符合 W3C 的 DOM。
  3. 显然,有效的 HTML 解析是在 WebKit 中实现的,因此使用phantomcasper将是一种选择,但那些需要以特殊方式运行,而不仅仅是node <script>. 我不能依赖这种变化所隐含的风险。例如,找到如何node-inspector使用phantom.
  4. Spooky是一个选项,但它有足够缺陷,因此它根本无法在我的机器上运行。

What's the right way to parse an HTML page with XPath then?

那么用 XPath 解析 HTML 页面的正确方法是什么?

回答by pda

You can do so in several steps.

您可以通过几个步骤来做到这一点。

  1. Parse HTML with parse5. The bad part is that the result is not DOM. Though it's fast enough and W3C-compiant.
  2. Serialize it to XHTML with xmlserializerthat accepts DOM-like structures of parse5as input.
  3. Parse that XHTML again with xmldom. Now you finally have that DOM.
  4. The xpathlibrary builds upon xmldom, allowing you to run XPath queries. Be aware that XHTML has its own namespace, and queries like //awon't work.
  1. 使用parse5. 不好的部分是结果不是 DOM。虽然它足够快并且符合 W3C。
  2. 将其序列化为XHTML,xmlserializer并接受类似 DOM 的结构parse5作为输入。
  3. 再次解析该 XHTML xmldom。现在你终于拥有了那个 DOM。
  4. xpath库基于xmldom,允许您运行 XPath 查询。请注意,XHTML 有自己的命名空间,像这样的查询是//a行不通的。

Finally you get something like this.

最后你会得到这样的东西。

const fs = require('mz/fs');
const xpath = require('xpath');
const parse5 = require('parse5');
const xmlser = require('xmlserializer');
const dom = require('xmldom').DOMParser;

(async () => {
    const html = await fs.readFile('./test.htm');
    const document = parse5.parse(html.toString());
    const xhtml = xmlser.serializeToString(document);
    const doc = new dom().parseFromString(xhtml);
    const select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});
    const nodes = select("//x:a/@href", doc);
    console.log(nodes);
})();

回答by mb21

Libxmljsis currently the fastest implementation (something like a benchmark) since it's only bindings to the LibXMLC-library which supports XPath 1.0 queries:

Libxmljs目前是最快的实现(类似于基准测试),因为它仅绑定到支持 XPath 1.0 查询的LibXMLC 库:

var libxmljs = require("libxmljs");
var xmlDoc = libxmljs.parseXml(xml);
// xpath queries
var gchild = xmlDoc.get('//grandchild');

However, you need to sanitize your HTML first and convert it to proper XML. For that you could either use the HTMLTidycommand line utility (tidy -q -asxml input.html), or if you want it to keep node-only, something like xmlserializershould do the trick.

但是,您需要先清理 HTML 并将其转换为正确的 XML。为此,您可以使用HTMLTidy命令行实用程序 ( tidy -q -asxml input.html),或者如果您希望它仅保留节点,则可以使用xmlserializer之类的工具。

回答by Soren

I have just started using npm install htmlstrip-nativewhich uses a native implementationto parse and extract the relevant html parts. It is claiming to be 50 times faster than the pure js implementation (I have not verified that claim).

我刚刚开始使用npm install htmlstrip-native它使用本机实现来解析和提取相关的 html 部分。它声称比纯 js 实现快 50 倍(我尚未验证该说法)。

Depending on your needs you can use html-strip directly, or lift the code and bindings to make you own C++ used internally in htmlstrip-native

根据你的需要,你可以直接使用 html-strip,或者提升代码和绑定,让你自己在 htmlstrip-native 内部使用的 C++

If you want to use xpath, then use the wrapper already avaialble here; https://www.npmjs.org/package/xpath

如果你想使用 xpath,那么使用这里已经可用的包装器; https://www.npmjs.org/package/xpath

回答by rchipka

I think Osmosisis what you're looking for.

我认为Osmosis就是你要找的。

  • Uses native libxml C bindings
  • Supports CSS 3.0 and XPath 1.0 selector hybrids
  • Sizzle selectors, Slick selectors, and more
  • No large dependencies like jQuery, cheerio, or jsdom
  • HTML parser features

    • Fast parsing
    • Very fast searching
    • Small memory footprint
  • HTML DOM features

    • Load and search ajax content
    • DOM interaction and events
    • Execute embedded and remote scripts
    • Execute code in the DOM
  • 使用本机 libxml C 绑定
  • 支持 CSS 3.0 和 XPath 1.0 选择器混合
  • Sizzle 选择器、Slick 选择器等
  • 没有像 jQuery、cheerio 或 jsdom 这样的大依赖
  • HTML 解析器功能

    • 快速解析
    • 搜索速度非常快
    • 内存占用小
  • HTML DOM 特性

    • 加载和搜索ajax内容
    • DOM 交互和事件
    • 执行嵌入式和远程脚本
    • 在 DOM 中执行代码

Here's an example:

下面是一个例子:

osmosis.get(url)
    .find('//div[@class]/ul[2]/li')
    .then(function () {
        count++;
    })
    .done(function () {
        assert.ok(count == 2);
        assert.done();
    });

回答by Hieu Van

With just one line, you can do it with xpath-html:

只需一行,您就可以使用xpath-html

const xpath = require("xpath-html");

const node = xpath.fromPageSource(html).findElement("//*[text()='Made with love by']");

回答by pateheo

There might be never a right way to parse HTML pages. A very first review on web scraping and crawling shows me that Scrapycan be a good candidate for your need. It accepts both CSS and XPath selectors. In the realm of Node.js, we have a pretty new module node-osmosis. This module is built upon libxmljs so that it is supposed to support both CSS and XPath although I did not find any example using XPath.

可能永远没有正确的方法来解析 HTML 页面。对网页抓取和抓取的第一篇评论告诉我Scrapy可以成为满足您需求的理想选择。它接受 CSS 和 XPath 选择器。在 Node.js 领域,我们有一个非常新的模块node-osmosis。这个模块建立在 libxmljs 之上,因此它应该支持 CSS 和 XPath,尽管我没有找到任何使用 XPath 的例子。