java中的utf-8解码

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1642292/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 18:35:35  来源:igfitidea点击:

utf-8 decoding in java

javaencodingutf-8groovy

提问by user162346

I'm trying to pass parameters from a PHP middle tier to a java backend that understands J2EE. I'm writing the controller code in Groovy. In there, I'm trying to decode some parameter that will likely contain international characters.

我正在尝试将参数从 PHP 中间层传递到理解 J2EE 的 Java 后端。我正在用 Groovy 编写控制器代码。在那里,我试图解码一些可能包含国际字符的参数。

I am really puzzled by the results of my debugging this problem so far, hence I wanted to share it with you in the hope that someone will be able to give the correct interpretation of my results.

到目前为止,我对调试此问题的结果感到非常困惑,因此我想与您分享,希望有人能够正确解释我的结果。

For the sake of my little test, the parameter I'm passing is "déjeuner". Just to be sure, System.out.println("déjeuner") correctly gives me:

为了我的小测试,我传递的参数是“déjeuner”。可以肯定的是, System.out.println("déjeuner") 正确地给了我:

déjeuner

in the console

在控制台中

Now following are the char/dec and hex values of each char of the original string:

现在以下是原始字符串的每个字符的字符/十进制和十六进制值:

next char: d 100 64
next char: ? -61 c3
next char: ? -87 a9
next char: j 106 6a
next char: e 101 65
next char: u 117 75
next char: n 110 6e
next char: e 101 65
next char: r 114 72

note that the c3a9 sequence in UTF-8 is the wished-for character: http://www.fileformat.info/info/unicode/char/00e9/index.htm

请注意,UTF-8 中的 c3a9 序列是所需的字符:http: //www.fileformat.info/info/unicode/char/00e9/index.htm

Now if I try to read this string as an UTF-8 string, as in stmt.getBytes("UTF-8"), I suddenly end up having a 11 bytes sequence, as follows:

现在,如果我尝试将此字符串作为 UTF-8 字符串读取,如在 stmt.getBytes("UTF-8") 中,我突然结束了 11 个字节的序列,如下所示:

64 c3 83 c2 a9 6a 65 75 6e 65 72

whereas stmt.getBytes("iso-8859-1") gives me 9 bytes:

而 stmt.getBytes("iso-8859-1") 给了我 9 个字节:

64 c3 a9 6a 65 75 6e 65 72

note the c3a9 sequence here!

注意这里的 c3a9 序列!

now if I try to convert the UTF-8 sequence to UTF-8, as in

现在,如果我尝试将 UTF-8 序列转换为 UTF-8,如

new String(stmt.getBytes("UTF-8"), "UTF-8");

I get:

我得到:

next char: d 100 64
next char: ? -61 c3
next char: ? -87 a9
next char: j 106 6a
next char: e 101 65
next char: u 117 75
next char: n 110 6e
next char: e 101 65
next char: r 114 72

note the c3a9 sequence

注意 c3a9 序列

while

尽管

new String(stmt.getBytes("iso-8859-1"), "UTF-8")

results in:

结果是:

next char: d 100 64
next char: ? -23 e9
next char: j 106 6a
next char: e 101 65
next char: u 117 75
next char: n 110 6e
next char: e 101 65
next char: r 114 72

note the e9 which in utf-8 (and ascii) is, again, the 'é' character that I'm longing for.

请注意 utf-8(和 ascii)中的 e9 再次是我渴望的“é”字符。

Unfortunately, in neither case am I ending up with a proper string that would display like the literal string "déjeuner". Strangely enough, the byte sequences both seem correct though.

不幸的是,在这两种情况下,我都不会得到一个像文字字符串“déjeuner”一样显示的正确字符串。奇怪的是,字节序列虽然看起来都是正确的。

采纳答案by Aaron Digulla

When dealing with Strings, always remember: byte!= char. So in your first example, you have the char c3, not the byte c3which is a huge difference: The bytewould be part of the UTF-8 sequence but the charalready is Unicode. So when you convert that to UTF-8, the Unicode character c3must become the bytesequence c3 83.

处理字符串时,请始终记住:byte!= char。所以在你的第一个例子中,你有char c3, 而不是 ,byte c3这是一个巨大的区别: Thebyte将是 UTF-8 序列的一部分,但char已经是 Unicode。因此,当您将其转换为 UTF-8 时,Unicode 字符c3必须成为byte序列c3 83.

So the question is: How did you get the String? There must be a bug in that code which doesn't properly handle UTF-8 encoded bytesequences.

所以问题是:你是如何得到字符串的?该代码中一定有一个错误,无法正确处理 UTF-8 编码byte序列。

The reason why ISO-8859-1usually works is that this encoding doesn't modify any charwith a code point < 256 (i.e. anything between 0 and 255), so UTF-8 encoded bytesequences won't be modified.

ISO-8859-1通常有效的原因是这种编码不会修改任何char代码点 < 256(即 0 到 255 之间的任何内容),因此byte不会修改UTF-8 编码序列。

Your last example is also wrong: The char e9is é in ISO-8859-1and Unicode. In UTF-8, it's not valid since it's not a byteand since it's the byte c3prefix is missing. That said, it correctly represents the Unicode string you seek.

你的最后一个例子也是错误的:char e9是 é inISO-8859-1和 Unicode。在 UTF-8 中,它无效,因为它不是 abyte并且byte c3缺少前缀。也就是说,它正确地表示了您要查找的 Unicode 字符串。

回答by McDowell

If you start with the Java String where "d\u00C3\u00A9jeuner".equals(stmt)then the data is already corrupt at this stage.

如果您从 Java 字符串开始,"d\u00C3\u00A9jeuner".equals(stmt)那么在此阶段数据已经损坏。

A Java charis not a C char. A charin Java is 16bits wide and implicitly contains UTF-16encoded data. Trying to store any other encoded data in a Java char/String type is asking for trouble. Character data in any other encoding should be as bytedata.

Javachar不是 C charcharJava 中的A是 16 位宽,隐式包含UTF-16编码数据。尝试将任何其他编码数据存储在 Java char/String 类型中是自找麻烦。任何其他编码的字符数据都应该作为byte数据。

If you are reading the parameter using the servlet API, then it is likely that the HTTP request contains inconsistent or insufficient encoding information. Check the calling code and the HTTP headers. It is likely that the client is encoding the data as UTF-8, but the servlet is decoding it as ISO-8859-1.

如果您正在使用 servlet API读取参数,那么很可能 HTTP 请求包含不一致或不足的编码信息。检查调用代码和 HTTP 标头。客户端很可能将数据编码为UTF-8,但 servlet 将其解码为ISO-8859-1

回答by Martin

I'm having a very similar problem except that my form uses "GET" request not a "POST" request.

我有一个非常相似的问题,只是我的表单使用“GET”请求而不是“POST”请求。

So, my URL is something like: http://localhost:4502/form.jsp?query=d%C3%A9jeuner

所以,我的 URL 类似于:http://localhost:4502/form.jsp?query=d%C3%A9jeuner

request.getCharacterEncoding() = ISO-8859-1
response.getCharacterEncoding() = UTF-8
request.getParameter("query") = d??jeuner

So should the HttpServletRequest use UTF-8 to decode the request param (which clearly it's not) or is this simply a browser error because the browser does not set any character encoding header (which again doesn't make much sense because it's not doing a post request). Here is the full set of headers and notice the %C3%A9 in the URL.

那么 HttpServletRequest 是否应该使用 UTF-8 来解码请求参数(显然不是),或者这仅仅是浏览器错误,因为浏览器没有设置任何字符编码标头(这又没有多大意义,因为它没有做发布请求)。这是完整的标头集,请注意 URL 中的 %C3%A9。

http://localhost:4502/form.jsp?query=d%C3%A9juerne

GET /form.jsp?query=d%C3%A9juerne HTTP/1.1
Host: localhost:4502
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-GB; rv:1.9.0.17) Gecko/2010010604 Ubuntu/9.04 (jaunty) Firefox/3.0.17
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive

This problem I'm having is that I actually copied and pasted the query into the browser form and it incorrectly encoded it. Both in chrome and firefox.

我遇到的这个问题是我实际上将查询复制并粘贴到浏览器表单中,但它的编码不正确。在 chrome 和 Firefox 中。

回答by Martin

After some further investigation I found this answer

经过一些进一步的调查,我找到了这个答案

How to get UTF-8 working in Java webapps?.

如何让 UTF-8 在 Java webapps 中工作?.

It's all about setting URIEncoding="UTF-8" in the tomcat connector.

这就是在 tomcat 连接器中设置 URIEncoding="UTF-8" 的全部内容。

Now to figuring out on how to do this in the CMS we use (CQ5/Day).

现在要弄清楚如何在我们使用的 CMS 中执行此操作(CQ5/Day)。