java 使用 SAX 解析器解析 html
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/7817495/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Parsing html with SAX parser
提问by user972590
I am trying to parse the normal html file using SAX parser.
我正在尝试使用 SAX 解析器解析普通的 html 文件。
SAXBuilder builder2 = new SAXBuilder();
try {
Document sdoc = (Document)builder2.build(readFile);
NodeList nl=sdoc.getElementsByTagName("body");
System.out.println("nodelist>>>>>>>>>>>"+nl.getLength());
} catch (JDOMException e1) {
e1.printStackTrace();
}
but i am getting the exception
但我得到了例外
Open quote is expected for attribute "{1}" associated with an element type "class".
can anyone please tell me why i am getting this exception, the html document is well formed and it has all the open and close tags properly.
任何人都可以告诉我为什么我会收到此异常,html 文档格式正确,并且所有打开和关闭标签都正确。
Thanks in advance.
提前致谢。
回答by Tom Anderson
As flash says, you need an HTML parser, not an XML parser. HTML is not XML.
正如 flash 所说,您需要一个 HTML 解析器,而不是 XML 解析器。HTML 不是 XML。
Good parsers i've used are Nekoand TagSoup. Neko is a good all-round parser; TagSoup specifically aims to be able to parse anything, no matter how ill-formed.
我用过的好的解析器是Neko和TagSoup。Neko 是一个很好的全能解析器;TagSoup 的目标是能够解析任何内容,无论格式多么糟糕。
回答by Stephen C
Generally speaking, you cannot parse HTML with an XML parser:
一般来说,您不能使用 XML 解析器解析 HTML:
HTML's element tags are not required to match in all cases. (For example a
<p>
tag does not require a matching</p>
tag.) This will cause terminal indigestion for an XML parser.Real-world HTML is notorious for not being conformant to the HTML spec, let alone an XML compatible subset of HTML.
HTML 的元素标签不需要在所有情况下都匹配。(例如,
<p>
标签不需要匹配的</p>
标签。)这将导致 XML 解析器的终端消化不良。现实世界的 HTML 因不符合 HTML 规范而臭名昭著,更不用说兼容 XML 的 HTML 子集了。
However, if your input document is XHTML, you should in theorybe able to use an XML parser such as SAX. You shouldeven be able to validate the document against the XHTML schema.
但是,如果您的输入文档是 XHTML,那么理论上您应该能够使用 XML 解析器,例如 SAX。你应该甚至能够验证对XHTML架构文档。
回答by flash
Please have a look at HtmlParser. Normally SAX is not a good solution to parse html.
请看看HtmlParser。通常 SAX 不是解析 html 的好方法。