Java 带有 HTML 标题、问号的 Unicode 问题？65533；

Question

提问by James

I'm trying to parse the title from the following webpage: http://kid37.blogger.de/stories/1670573/

我正在尝试从以下网页解析标题：http: //kid37.blogger.de/stories/1670573/

When I use the apache.commons.lang StringEscapeUtils.escapeHTML method on the title element I get the following

当我在标题元素上使用 apache.commons.lang StringEscapeUtils.escapeHTML 方法时，我得到以下内容

Das hermetische Caf&#65533;: Rock &amp; Wrestling 2010

however when I display that in my webpage with utf-8 encoding it just shows a question mark.

但是，当我使用 utf-8 编码在我的网页中显示它时，它只显示一个问号。

Using the following code:

使用以下代码：

String title = StringEscapeUtils.escapeHtml(myTitle);

If I run the title through this website: http://tools.devshed.com/?option=com_mechtools&tool=27I get the following output which seems correct

如果我通过这个网站运行标题：http: //tools.devshed.com/?option=com_mechtools&tool=27我得到以下输出似乎是正确的

TITLE:

标题：

<title>Das hermetische Café: Rock &amp; Wrestling 2010</title>

BECOMES (which I was expecting the escapeHtml method to do):

BECOMES（我期待escapeHtml 方法这样做）：

<title>Das hermetische Caf&eacute;: Rock &amp; Wrestling 2010</title>

any ideas? thanks

有任何想法吗？谢谢

Answer 1

采纳答案by erickson

U+FFFD (decimal 65533) is the "replacement character". When a decoder encounters an invalid sequence of bytes, it may(depending on its configuration) substitute � for the corrupt sequence and continue.

U+FFFD（十进制65533）是“替换字符”。当解码器遇到无效的字节序列时，它可能（取决于其配置）用替换损坏的序列并继续。

One common reason for a "corrupt" sequence is that the wrong decoder has been applied. For example, the decoder might be UTF-8, but the page is actually encoded with ISO-8859-1 (the default if another is not specified in the content-type header or equivalent).

“损坏”序列的一个常见原因是应用了错误的解码器。例如，解码器可能是 UTF-8，但页面实际上是用 ISO-8859-1 编码的（如果在 content-type 标头或等效标头中未指定另一个，则为默认值）。

So, before you even pass the string to escapeHtml, the "é" has already been replaced with "�"; the method encodes this correctly.

因此，在您甚至将字符串传递给之前escapeHtml，“é”已经被替换为“ ”；该方法正确编码。

The page in question uses ISO-8859-1 encoding. Make sure that you are using that decoder when converting the fetched resource to a String.

有问题的页面使用 ISO-8859-1 编码。确保在将获取的资源转换为String.

Java 带有 HTML 标题、问号的 Unicode 问题？65533；

提问by James

采纳答案by erickson

相关推荐

最近更新

标签

Java 带有 HTML 标题、问号的 Unicode 问题？65533；

提问by James

采纳答案by erickson

相关推荐

Java Hibernate：未能延迟初始化角色集合，没有会话或会话被关闭

Java 当我们删除元素时，ArrayList 的容量会减少吗？

Java 如何限制 JTextField 中的字符数？

Java中使用私钥加密和解密

相关推荐

最近更新

标签