如何使用 Node.js 解析 HTML 页面

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7372972/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-02 14:34:03  来源:igfitidea点击:

How do I parse a HTML page with Node.js

node.jshtml-parsingserver-side

提问by Itay Moav -Malimovka

I need to parse (server side) big amounts of HTML pages.
We all agree that regexp is not the way to go here.
It seems to me that javascript is the native way of parsing a HTML page, but that assumption relies on the server side code having all the DOM ability javascript has inside a browser.

我需要解析(服务器端)大量的 HTML 页面。
我们都同意 regexp 不是去这里的方式。
在我看来,javascript 是解析 HTML 页面的本机方式,但这种假设依赖于具有 javascript 在浏览器中的所有 DOM 能力的服务器端代码。

Does Node.js have that ability built in?
Is there a better approach to this problem, parsing HTML on the server side?

Node.js 是否内置了这种能力?
有没有更好的方法来解决这个问题,在服务器端解析 HTML?

采纳答案by kzh

You can use the npmmodules jsdomand htmlparserto create and parse a DOM in Node.JS.

您可以使用npm模块jsdomhtmlparserNode.JS中创建和解析 DOM。

Other options include:

其他选项包括:

  • BeautifulSoupfor python
  • you can convert you html to xhtmland use XSLT
  • HTMLAgilityPackfor .NET
  • CsQueryfor .NET (my new favorite)
  • The spidermonkey and rhino JS engines have native E4X support. This may be useful, only if you convert your html to xhtml.
  • 用于python的BeautifulSoup
  • 您可以将html转换为 xhtml并使用 XSLT
  • .NET 的HTMLAgilityPack
  • CsQueryfor .NET(我的新宠)
  • Spidermonkey 和 rhino JS 引擎具有原生 E4X 支持。这可能很有用,仅当您将 html 转换为 xhtml 时。

Out of all these options, I prefer using the Node.js option, because it uses the standard W3C DOM accessor methods and I can reuse code on both the client and server. I wish BeautifulSoup's methods were more similar to the W3C dom, and I think converting your HTML to XHTML to write XSLT is just plain sadistic.

在所有这些选项中,我更喜欢使用 Node.js 选项,因为它使用标准的 W3C DOM 访问器方法,并且我可以在客户端和服务器上重用代码。我希望 BeautifulSoup 的方法更类似于 W3C dom,而且我认为将您的 HTML 转换为 XHTML 以编写 XSLT 只是简单的虐待狂。

回答by Meekohi

Use Cheerio. It isn't as strict as jsdom and is optimized for scraping. As a bonus, uses the jQuery selectors you already know.

使用Cheerio。它不像 jsdom 那样严格,并且针对抓取进行了优化。作为奖励,使用您已经知道的 jQuery 选择器。

? Familiar syntax: Cheerio implements a subset of core jQuery. Cheerio removes all the DOM inconsistencies and browser cruft from the jQuery library, revealing its truly gorgeous API.

? Blazingly fast: Cheerio works with a very simple, consistent DOM model. As a result parsing, manipulating, and rendering are incredibly efficient. Preliminary end-to-end benchmarks suggest that cheerio is about 8x faster than JSDOM.

? Insanely flexible: Cheerio wraps around @FB55's forgiving htmlparser. Cheerio can parse nearly any HTML or XML document.

? 熟悉的语法:Cheerio 实现了核心 jQuery 的一个子集。Cheerio 从 jQuery 库中删除了所有 DOM 不一致和浏览器残留,展示了其真正华丽的 API。

? 极快:Cheerio 使用非常简单、一致的 DOM 模型。因此,解析、操作和渲染非常高效。初步的端到端基准测试表明,cheerio 比 JSDOM 快 8 倍。

? 非常灵活:Cheerio 围绕着@FB55 的宽容性 htmlparser。Cheerio 几乎可以解析任何 HTML 或 XML 文档。

回答by Anderson Madeira

Use htmlparser2, its way faster and pretty straightforward. Consult this usage example:

使用htmlparser2,它的方式更快且非常简单。请参阅此用法示例:

https://www.npmjs.org/package/htmlparser2#usage

https://www.npmjs.org/package/htmlparser2#usage

And the live demo here:

现场演示在这里:

http://demos.forbeslindesay.co.uk/htmlparser2/

http://demos.forbeslindesay.co.uk/htmlparser2/

回答by esp

Htmlparser2by FB55 seems to be a good alternative.

FB55 的Htmlparser2似乎是一个不错的选择。

回答by Yarek T

jsdom is too strict to do any real screen scraping sort of things, but beautifulsoup doesn't choke on bad markup.

jsdom 太严格了,不能做任何真正的屏幕抓取之类的事情,但是beautifulsoup 不会因为糟糕的标记而窒息。

node-soupselectis a port of python's beautifulsoup into nodejs, and it works beautifully

node-soupselect是python的beautifulsoup到nodejs的一个端口,运行起来很漂亮