java 使用 SAX 解析器解析 html

Question

提问by user972590

I am trying to parse the normal html file using SAX parser.

我正在尝试使用 SAX 解析器解析普通的 html 文件。

SAXBuilder builder2 = new SAXBuilder();
         try {
            Document sdoc = (Document)builder2.build(readFile);
            NodeList nl=sdoc.getElementsByTagName("body");
            System.out.println("nodelist>>>>>>>>>>>"+nl.getLength());

        } catch (JDOMException e1) {
            e1.printStackTrace();
        }

but i am getting the exception

但我得到了例外

Open quote is expected for attribute "{1}" associated with an  element type  "class".

can anyone please tell me why i am getting this exception, the html document is well formed and it has all the open and close tags properly.

任何人都可以告诉我为什么我会收到此异常，html 文档格式正确，并且所有打开和关闭标签都正确。

Thanks in advance.

提前致谢。

Answer 1

回答by Tom Anderson

As flash says, you need an HTML parser, not an XML parser. HTML is not XML.

正如 flash 所说，您需要一个 HTML 解析器，而不是 XML 解析器。HTML 不是 XML。

Good parsers i've used are Nekoand TagSoup. Neko is a good all-round parser; TagSoup specifically aims to be able to parse anything, no matter how ill-formed.

我用过的好的解析器是Neko和TagSoup。Neko 是一个很好的全能解析器；TagSoup 的目标是能够解析任何内容，无论格式多么糟糕。

Answer 2

回答by Stephen C

Generally speaking, you cannot parse HTML with an XML parser:

一般来说，您不能使用 XML 解析器解析 HTML：

HTML's element tags are not required to match in all cases. (For example a <p>tag does not require a matching </p>tag.) This will cause terminal indigestion for an XML parser.
Real-world HTML is notorious for not being conformant to the HTML spec, let alone an XML compatible subset of HTML.

HTML 的元素标签不需要在所有情况下都匹配。（例如，<p>标签不需要匹配的</p>标签。）这将导致 XML 解析器的终端消化不良。
现实世界的 HTML 因不符合 HTML 规范而臭名昭著，更不用说兼容 XML 的 HTML 子集了。

However, if your input document is XHTML, you should in theorybe able to use an XML parser such as SAX. You shouldeven be able to validate the document against the XHTML schema.

但是，如果您的输入文档是 XHTML，那么理论上您应该能够使用 XML 解析器，例如 SAX。你应该甚至能够验证对XHTML架构文档。

Answer 3

回答by flash

Please have a look at HtmlParser. Normally SAX is not a good solution to parse html.

请看看HtmlParser。通常 SAX 不是解析 html 的好方法。

java 使用 SAX 解析器解析 html

提问by user972590

回答by Tom Anderson

回答by Stephen C

回答by flash

相关推荐

最近更新

标签

java 使用 SAX 解析器解析 html

提问by user972590

回答by Tom Anderson

回答by Stephen C

回答by flash

相关推荐

停止特定的 Java 线程

java 为什么不调用 finalize？

java 使用 bean 从 JSF 页面发送邮件

使用 Node.js Crypto 模块加密并使用 Java 解密（在 Android 应用程序中）

相关推荐

最近更新

标签