java Java字符集编码问题(从UTF8到cp866)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4782662/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 08:00:21  来源:igfitidea点击:

Java char set encoding problem(from UTF8 to cp866)

javacharacter-encoding

提问by NikolayGS

How to convert text from utf8/cp1251(windows cyrillic) to DOS Cyrillic (cp866)

如何将文本从 utf8/cp1251(windows cyrillic) 转换为 DOS Cyrillic (cp866)

I find this example:

我发现这个例子:

Charset fromCharset = Charset.forName("utf8");
Charset toCharset = Charset.forName("cp866");

String text1 = "Николай"; // my name in bulgarian
String text2 = "Nikolay"; // my name in english

System.out.println("TEXT1 :[" + toCharset.decode(fromCharset.encode(text1)).toString() + "]");
System.out.println("TEXT2 :[" + toCharset.decode(fromCharset.encode(text2)).toString() + "]");

And the input is:

输入是:

TEXT1 :[╨Э╨╕╨║╨╛╨╗╨?╨╣] // WRONG
TEXT2 :[Nikolay]  // CORRECT

Where is the problem?

哪里有问题?

回答by Joachim Sauer

First of: if you've got a Stringobject, then it no longer has an encoding, it's a pure Unicode string(*)!

首先:如果你有一个String对象,那么它就不再有编码了,它是一个纯 Unicode 字符串(*)!

In Java, encodings are used onlywhen you convert from bytes (byte[]) to a string (String) or vice versa. (You could theoretically do a direct conversion from byte[]to byte[]but I've yet to see that done in Java).

在 Java 中,当您将字节 ( byte[]) 转换为字符串 ( String)时才使用编码,反之亦然。(理论上你可以直接从byte[]to转换,byte[]但我还没有看到用 Java 完成的)。

Ifyou have some cp1251 encoded data, then it must be either a byte[](i.e. an array of bytes) or in some kind of stream (e.g. provided to you as an InputStream).

如果您有一些 cp1251 编码数据,那么它必须是一个byte[](即字节数组)或某种流(例如作为 提供给您InputStream)。

Ifyou want to provide some data as cp866, then you must provide it either as a byte[]or as some kind of stream (e.g. an `OutputStream).

如果您想以 cp866 的形式提供一些数据,那么您必须将其作为byte[]流或某种流(例如`OutputStream)提供。

Also: there's no such thing as "utf8/cp1251". UTF-8 and CP-1251 are pretty much unrelated character encodings. Your input is either UTF-8 or CP-1251 (or something else). It can't really be both (+).

另外:没有“utf8/cp1251”这样的东西。UTF-8 和 CP-1251 是几乎不相关的字符编码。您的输入是 UTF-8 或 CP-1251(或其他)。它不能真正同时是 (+)。

And here's the obligatory link: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

这是强制性链接:每个软件开发人员绝对,肯定必须了解 Unicode 和字符集的绝对最低要求(没有借口!)

(*) yes, strictly speaking it has an encoding and it is UTF-16, but for most purposes you can (and should) think of it as an "encodingless ideal Unicode String"
(+) strictly speaking it could be both if it's only using character that encode to the same bytes in both encodings, which is usually the ASCII subset

(*) 是的,严格来说它有一个编码并且它是 UTF-16,但是对于大多数目的,您可以(并且应该)将其视为“无编码的理想 Unicode 字符串”
(+)严格来说,如果它是仅使用在两种编码中编码为相同字节的字符,这通常是 ASCII 子集

回答by Jon Skeet

The problem is that you're trying to decode the output of one encoding as if it's a different one.

问题是您试图解码一种编码的输出,就好像它是不同的一样。

Imagine that you had a program which could only write out JPEGs, and another which could only read PNGs... would you expect to be able to read the output of the first program with the second?

想象一下,您有一个程序只能写出 JPEG,另一个程序只能读取 PNG……您是否希望能够用第二个程序读取第一个程序的输出?

In this case the two encodings happen to be compatible for ASCII characters, but fundamentally you're doing the wrong thing.

在这种情况下,这两种编码碰巧与 ASCII 字符兼容,但从根本上说,您做错了。

If you have text which is already in UTF-8, you should read that from binary data into a Unicode string using the UTF-8 encoding, and then write it out using your other encoding tobinary data again. Unicode is the intermediate step basically, as Java's native text format. This would be the equivalent to loading the JPEG output into another program which could perform the conversion to PNG before you read it with the second app.

如果您有 UTF-8 格式的文本,您应该使用 UTF-8 编码将其从二进制数据读取为 Unicode 字符串,然后再次使用其他编码其写出二进制数据。Unicode 基本上是中间步骤,作为 Java 的本机文本格式。这相当于将 JPEG 输出加载到另一个程序中,该程序可以在您使用第二个应用程序读取之前执行到 PNG 的转换。

回答by basil

Short solve for your problem:

简短解决您的问题:

 System.out.write("ВАСЯ\n".getBytes("cp866")); // its right
 System.out.println("ВАСЯ".getBytes("cp866")); // its wrong

Result from cmd.exe:

cmd.exe 的结果:

C:\Documents and Settings\afram\Мои документы\NetBeansProjects\Encoding\dist>java -jar Encoding.jar

C:\Documents and Settings\afram\Мои документы\NetBeansProjects\Encoding\dist>java -jar Encoding.jar

ВАСЯ

ВАСЯ

[B@1bab50a

[B@1bab50a

回答by josefx

Short:

短的:

You decode an utf8 String as cp866. Since utf8 and cp866 only share ascii symbols everything else gets mangled.

您将 utf8 字符串解码为 cp866。由于 utf8 和 cp866 只共享 ascii 符号,其他一切都被破坏了。

Long:

长:

Java represents Strings using UTF-16 internally, all String objects are encoded in UTF-16.

Java 在内部使用 UTF-16 来表示字符串,所有的 String 对象都以 UTF-16 编码。

Charset.encode()creates a bytebuffer containing the String in the choosen encoding, in your code this converts the Java UTF-16 String into a utf-8 encoded byte-array.

Charset.encode()创建一个包含所选编码中的字符串的字节缓冲区,在您的代码中,这会将 Java UTF-16 字符串转换为 utf-8 编码的字节数组。

Charset.decode()takes a bytebuffer encoded as Charset and converts this into a Java UTF-16 String. In your case you decode a utf-8string with a cp866decoder, resulting in a mangled String.

Charset.decode()接受编码为 Charset 的字节缓冲区并将其转换为 Java UTF-16 字符串。在您的情况下,您使用解码器对utf-8字符串进行cp866解码,从而导致字符串损坏。

Since java Strings have a specified encoding you have to specify it when you read or write them. Both InputStreamReader and OutputStreamWriter provide ctors with a Charset argument.

由于 java 字符串具有指定的编码,因此您必须在读取或写入它们时指定它。InputStreamReader 和 OutputStreamWriter 都为ctors 提供了一个Charset 参数。

Here an example on how you can convert files/streams.

这是有关如何转换文件/流的示例。

//input the source is encoded in fromCharset
BufferedReader in = new BufferedReader(new InputStreamReader(...,fromCharset));
//output the target will be encoded in toCharset
PrintWriter out = new PrintWriter(new OutputStreamWriter(...,toCharset));
//reads a decoded String
String line = in.readLine();
while(line != null)
{
   out.println(line);
   line = in.readLine();
}

回答by Danubian Sailor

The problem is, your console output isn't cp866. Console is one, converting is other.

问题是,您的控制台输出不是 cp866。控制台是一个,转换是另一个。

Internally String in java is always unicode, charset is important for input/output operations. You haven't specified what you want to do with 'converted' string, but you should definetly see classes InputStreamReader / OutputStreamWriter. They provide charset setting for your I/O operations.

在java内部String总是unicode,charset对于输入/输出操作很重要。您尚未指定要对“已转换”字符串执行的操作,但您应该明确地看到类 InputStreamReader / OutputStreamWriter。它们为您的 I/O 操作提供字符集设置。