在 Java 中解析 HTML 数据，包括 < 和 > 标签？

Question

提问by Deepu

I want to parse HTML text in Java.

我想用 Java 解析 HTML 文本。

I have tried to parse HTML data using javax.swing.text.html.HTMLEditorKit. It helped me to get data from HTML. But I have a HTML data like -

我尝试使用javax.swing.text.html.HTMLEditorKit解析 HTML 数据。它帮助我从 HTML 获取数据。但我有一个 HTML 数据，比如 -

&lt;span class="TitleServiceChange" &gt;Service Change&lt;/span&gt;
                    &lt;span class="DateStyle"&gt;
                    &amp;nbsp;Posted:&amp;nbsp;12/16/2012&amp;nbsp; 8:00PM
                    &lt;/span&gt;&lt;br/&gt;&lt;br/&gt;
                  &lt;P&gt;

with surrounding '&lt'and '&gt'instead of '<'and '>'

用'<'和'>'代替'<'和'>'

While parsing the above text I am getting the error -

在解析上述文本时，我收到错误 -

Parsing error: start.missing body ? ? at

Please suggest me to resolve my problem. Thanks in advance.

请建议我解决我的问题。提前致谢。

Answer 1

回答by Tomas Narros

For unescaping the full set of escaped characters included at a string, you could make use of the Apache Commons Langutility library.

要取消转义字符串中包含的完整转义字符集，您可以使用Apache Commons Lang实用程序库。

Specifically, using the StringEscapeUtilsclass, where you can find the unescapeHtml4method, among others.

具体来说，使用StringEscapeUtils类，您可以在其中找到该unescapeHtml4方法等。

Answer 2

回答by Juvanis

If you can get the Stringrepresentation of the data, replacing it with the correct tags could resolve your problem:

如果您可以获得String数据的表示，用正确的标签替换它可以解决您的问题：

String htmlData = ...

htmlData = htmlData.replaceAll("&lt;", "<");
htmlData = htmlData.replaceAll("&gt;", ">");

Answer 3

回答by Raffaele

HTML can be described in XML terms. XML has the concept of character data, obviously made up by characters. There are five characters that have special meaning in XML: >, <, &, "and '- these are used to define elements and delimit attributes, so the parser doesn't treat them like normalcharacters. When you need to insert a <literal in a XML document (like I just did in this answer), you can use a character referencein the form <, so that the browser understands that you are not willing to start an XML tag. In HTML4 DTD there are 252 named entities, so it's infeasible to use replaceAll()to have a readablestring.

HTML 可以用 XML 术语来描述。XML 有字符数据的概念，显然是由字符组成的。有五个字符在XML中具有特殊的意义：>，<，&，"和'-这是用来定义元素和划属性，所以解析器并不像对待他们正常的字符。当您需要<在 XML 文档中插入文字时（就像我刚刚在这个答案中所做的那样），您可以在表单中使用字符引用<，以便浏览器理解您不愿意开始 XML 标记。在 HTML4 DTD 中有252 个命名实体，因此使用replaceAll()具有可读性的实体是不可行的细绳。

You'd better understand how HTML works, so that you think like a web browser when you have to architect storing and rendering of your data. Note that:

您最好了解 HTML 的工作原理，以便在必须构建数据的存储和呈现架构时像 Web 浏览器一样思考。注意：

&lt;tag&gt;

has a very different meaningthan

有一个非常不同的含义比

<tag>

So you'd better argument your question to get help in the right direction.

所以你最好争论你的问题以获得正确方向的帮助。

在 Java 中解析 HTML 数据，包括 < 和 > 标签？

提问by Deepu

回答by Tomas Narros

回答by Juvanis

回答by Raffaele

相关推荐

最近更新

标签

在 Java 中解析 HTML 数据，包括 < 和 > 标签？

提问by Deepu

回答by Tomas Narros

回答by Juvanis

回答by Raffaele

相关推荐

java 如何在 Spring MVC 中处理 HTTP 标头？

java 将选项传递给 chrome 驱动程序 selenium

JAVA_HOME : java 安装在哪里

java readLine() 返回 null

相关推荐

最近更新

标签