在 Java 中解析 HTML 数据,包括 < 和 > 标签?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/13914010/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Parse HTML data in Java including < and > tags?
提问by Deepu
I want to parse HTML text in Java.
我想用 Java 解析 HTML 文本。
I have tried to parse HTML data using javax.swing.text.html.HTMLEditorKit. It helped me to get data from HTML. But I have a HTML data like -
我尝试使用javax.swing.text.html.HTMLEditorKit解析 HTML 数据。它帮助我从 HTML 获取数据。但我有一个 HTML 数据,比如 -
<span class="TitleServiceChange" >Service Change</span>
<span class="DateStyle">
&nbsp;Posted:&nbsp;12/16/2012&nbsp; 8:00PM
</span><br/><br/>
<P>
with surrounding '<'and '>'instead of '<'and '>'
用'<'和'>'代替'<'和'>'
While parsing the above text I am getting the error -
在解析上述文本时,我收到错误 -
Parsing error: start.missing body ? ? at
Please suggest me to resolve my problem. Thanks in advance.
请建议我解决我的问题。提前致谢。
回答by Tomas Narros
For unescaping the full set of escaped characters included at a string, you could make use of the Apache Commons Langutility library.
要取消转义字符串中包含的完整转义字符集,您可以使用Apache Commons Lang实用程序库。
Specifically, using the StringEscapeUtilsclass, where you can find the unescapeHtml4
method, among others.
具体来说,使用StringEscapeUtils类,您可以在其中找到该unescapeHtml4
方法等。
回答by Juvanis
If you can get the String
representation of the data, replacing it with the correct tags could resolve your problem:
如果您可以获得String
数据的表示,用正确的标签替换它可以解决您的问题:
String htmlData = ...
htmlData = htmlData.replaceAll("<", "<");
htmlData = htmlData.replaceAll(">", ">");
回答by Raffaele
HTML can be described in XML terms. XML has the concept of character data, obviously made up by characters. There are five characters that have special meaning in XML: >
, <
, &
, "
and '
- these are used to define elements and delimit attributes, so the parser doesn't treat them like normalcharacters. When you need to insert a <
literal in a XML document (like I just did in this answer), you can use a character referencein the form <
, so that the browser understands that you are not willing to start an XML tag. In HTML4 DTD there are 252 named entities, so it's infeasible to use replaceAll()
to have a readablestring.
HTML 可以用 XML 术语来描述。XML 有字符数据的概念,显然是由字符组成的。有五个字符在XML中具有特殊的意义:>
,<
,&
,"
和'
-这是用来定义元素和划属性,所以解析器并不像对待他们正常的字符。当您需要<
在 XML 文档中插入文字时(就像我刚刚在这个答案中所做的那样),您可以在表单中使用字符引用<
,以便浏览器理解您不愿意开始 XML 标记。在 HTML4 DTD 中有252 个命名实体,因此使用replaceAll()
具有可读性的实体是不可行的细绳。
You'd better understand how HTML works, so that you think like a web browser when you have to architect storing and rendering of your data. Note that:
您最好了解 HTML 的工作原理,以便在必须构建数据的存储和呈现架构时像 Web 浏览器一样思考。注意:
<tag>
has a very different meaningthan
有一个非常不同的含义比
<tag>
So you'd better argument your question to get help in the right direction.
所以你最好争论你的问题以获得正确方向的帮助。