Java 带有 HTML 标题、问号的 Unicode 问题?65533;

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3526965/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-14 01:32:43  来源:igfitidea点击:

Unicode issue with an HTML Title, question mark? 65533;

javahtmlunicodeutf-8

提问by James

I'm trying to parse the title from the following webpage: http://kid37.blogger.de/stories/1670573/

我正在尝试从以下网页解析标题:http: //kid37.blogger.de/stories/1670573/

When I use the apache.commons.lang StringEscapeUtils.escapeHTML method on the title element I get the following

当我在标题元素上使用 apache.commons.lang StringEscapeUtils.escapeHTML 方法时,我得到以下内容

Das hermetische Caf�: Rock & Wrestling 2010

however when I display that in my webpage with utf-8 encoding it just shows a question mark.

但是,当我使用 utf-8 编码在我的网页中显示它时,它只显示一个问号。

Using the following code:

使用以下代码:

String title = StringEscapeUtils.escapeHtml(myTitle);

If I run the title through this website: http://tools.devshed.com/?option=com_mechtools&tool=27I get the following output which seems correct

如果我通过这个网站运行标题:http: //tools.devshed.com/?option=com_mechtools&tool=27我得到以下输出似乎是正确的

TITLE:

标题:

<title>Das hermetische Café: Rock &amp; Wrestling 2010</title>

BECOMES (which I was expecting the escapeHtml method to do):

BECOMES(我期待escapeHtml 方法这样做):

<title>Das hermetische Caf&eacute;: Rock &amp; Wrestling 2010</title>

any ideas? thanks

有任何想法吗?谢谢

采纳答案by erickson

U+FFFD (decimal 65533) is the "replacement character". When a decoder encounters an invalid sequence of bytes, it may(depending on its configuration) substitute � for the corrupt sequence and continue.

U+FFFD(十进制65533)是“替换字符”。当解码器遇到无效的字节序列时,它可能(取决于其配置)用 替换损坏的序列并继续。

One common reason for a "corrupt" sequence is that the wrong decoder has been applied. For example, the decoder might be UTF-8, but the page is actually encoded with ISO-8859-1 (the default if another is not specified in the content-type header or equivalent).

“损坏”序列的一个常见原因是应用了错误的解码器。例如,解码器可能是 UTF-8,但页面实际上是用 ISO-8859-1 编码的(如果在 content-type 标头或等效标头中未指定另一个,则为默认值)。

So, before you even pass the string to escapeHtml, the "é" has already been replaced with "�"; the method encodes this correctly.

因此,在您甚至将字符串传递给 之前escapeHtml,“é”已经被替换为“ ”;该方法正确编码。

The page in question uses ISO-8859-1 encoding. Make sure that you are using that decoder when converting the fetched resource to a String.

有问题的页面使用 ISO-8859-1 编码。确保在将获取的资源转换为String.