java 汉字的UTF编码Java

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6838446/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 17:31:00  来源:igfitidea点击:

UTF Encoding for Chinese CharactersJava

javaencodingutf

提问by Maurice

I am receiving a String via an object from an axis webservice. Because I'm not getting the string I expected, I did a check by converting the string into bytes and I get C3A4C2 BDC2A0 C3A5C2 A5C2BD C3A5C2 90C297 in hexa, when I'm expecting E4BDA0 E5A5BD E59097 which is actually 你好吗 in UTF-8.

我正在通过轴网络服务的对象接收字符串。因为我没有得到我期望的字符串,我通过将字符串转换为字节进行了检查,我得到了 C3A4C2 BDC2A0 C3A5C2 A5C2BD C3A5C2 90C297 六进制,当我期待 E4BDA0 E5A5BD E59097 这实际上是你在 UTF- 8.

Any ideas what might be causing 你好吗 to become C3A4C2 BDC2A0 C3A5C2 A5C2BD C3A5C2 90C297? I did a Google search but all I got was a chinese website describing a problem that happens in python. Any insights will be great, thanks!

有什么想法可能导致你好吗变成 C3A4C2 BDC2A0 C3A5C2 A5C2BD C3A5C2 90C297?我进行了谷歌搜索,但我得到的只是一个描述 python 中发生的问题的中文网站。任何见解都会很棒,谢谢!

回答by Ray Toal

You have what is known as a double encoding.

您拥有所谓的双重编码。

You have the three character sequence "你好吗" which you correctly point out is encoded in UTF-8 as E4BDA0 E5A5BD E59097.

你有三个字符序列“你好吗”,你正确指出它以 UTF-8 编码为 E4BDA0 E5A5BD E59097。

But now, start encoding each byte of THAT encoding in UTF-8. Start with E4. What is thatcodepoint in UTF-8? Try it! It's C3 A4!

但是现在,开始在 UTF-8 中编码 THAT 编码的每个字节。从 E4 开始。UTF-8中的代码点是什么?试试看!是C3 A4!

You get the idea.... :-)

你明白了....:-)

Here is a Java app which illustrates this:

这是一个 Java 应用程序,它说明了这一点:

public class DoubleEncoding {
    public static void main(String[] args) throws Exception {
        byte[] encoding1 = "你好吗".getBytes("UTF-8");
        String string1 = new String(encoding1, "ISO8859-1");
        for (byte b : encoding1) {
            System.out.printf("%2x ", b);
        }
        System.out.println();
        byte[] encoding2 = string1.getBytes("UTF-8");
        for (byte b : encoding2) {
            System.out.printf("%2x ", b);
        }
        System.out.println();
    }

}

}