Java中的“修复”字符串编码

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2622911/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 10:00:35  来源:igfitidea点击:

"Fix" String encoding in Java

javaencoding

提问by Nico

I have a Stringcreated from a byte[]array, using UTF-8 encoding.
However, it should have been created using another encoding (Windows-1252).

我有一个Stringbyte[]数组创建的,使用 UTF-8 编码。
但是,它应该是使用另一种编码 (Windows-1252) 创建的。

Is there a way to convert this String back to the right encoding?

有没有办法将此字符串转换回正确的编码?

I know it's easy to do if you have access to the original byte array, but it my case it's too late because it's given by a closed source library.

我知道如果您可以访问原始字节数组,这很容易做到,但就我而言,为时已晚,因为它是由封闭源库提供的。

采纳答案by Joachim Sauer

As there seems to be some confusion on whether this is possible or not I think I'll need to provide an extensive example.

由于似乎对这是否可能存在一些混淆,我认为我需要提供一个广泛的例子。

The question claims that the (initial) input is a byte[]that contains Windows-1252encoded data. I'll call that byte[]ib(for "initial bytes").

该问题声称(初始)输入是byte[]包含Windows-1252编码数据的 a 。我会称之为byte[]ib(对于“初始字节”)。

For this example I'll choose the German word "B?r" (meaning bear) as the input:

对于这个例子,我将选择德语单词“B?r”(意思是熊)作为输入:

byte[] ib = new byte[] { (byte) 0x42, (byte) 0xE4, (byte) 0x72 };
String correctString = new String(ib, "Windows-1252");
assert correctString.charAt(1) == '\u00E4'; //verify that the character was correctly decoded.

(If your JVM doesn't support that encoding, then you can use ISO-8859-1 instead, because those three letters (and most others) are at the same position in those two encodings).

(如果您的 JVM 不支持该编码,那么您可以改用 ISO-8859-1,因为这三个字母(和大多数其他字母)在这两种编码中位于相同的位置)。

The question goes on to state that some other code (that is outside of our influence) already converted that byte[]to a String using the UTF-8 encoding (I'll call that Stringisfor "input String"). That Stringis the only inputthat is available to achieve our goal (if iswere available, it would be trivial):

问题继续说明一些其他代码(不在我们的影响范围内)已经byte[]使用 UTF-8 编码将其转换为字符串(我将其称为Stringis“输入字符串”)。这String是实现我们目标唯一可用的输入(如果is可用,那将是微不足道的):

String is = new String(ib, "UTF-8");
System.out.println(is);

This obviously produces the incorrect output "B?".

这显然会产生不正确的输出“B?”。

The goal would be to produce ib(or the correct decoding of that byte[]) with onlyisavailable.

目标将是仅在可用的情况下生成ib(或正确解码byte[])。is

Now some people claim that getting the UTF-8 encoded bytes from that iswill return an array with the same values as the initial array:

现在有些人声称从中is获取 UTF-8 编码的字节将返回一个与初始数组具有相同值的数组:

byte[] utf8Again = is.getBytes("UTF-8");

But that returns the UTF-8 encoding of the two characters Band ?and definitely returns the wrong result when re-interpreted as Windows-1252:

但这会返回两个字符的 UTF-8 编码,B并且?在重新解释为 Windows-1252 时肯定会返回错误的结果:

System.out.println(new String(utf8Again, "Windows-1252");

This line produces the output "B???", which is totally wrong (it is also the same output that would be the result if the initial array contained the non-word "Bür" instead).

这一行产生输出“B???”,这是完全错误的(如果初始数组包含非单词“Bür”,它也是相同的输出)。

So in this caseyou can't undo the operation, because information is lost.

所以在这种情况下你不能撤消操作,因为信息丢失了。

There arein fact cases where such mis-encodings can be undone. It's more likely to work, when all possible (or at least occuring) byte sequences are valid in that encoding. Since UTF-8 has several byte sequences that are simply not valid values, you willhave problems.

实际上情况下,这种错误的编码可以撤消。当所有可能(或至少出现)的字节序列在该编码中都有效时,它更有可能起作用。由于 UTF-8 有几个字节序列根本不是有效值,因此您遇到问题。

回答by LB40

You can use this tutorial

您可以使用本教程

The charset you need should be defined in rt.jar (according to this)

你需要的字符集应该在 rt.jar 中定义(根据这个

回答by kgiannakakis

What you want to do is impossible. Once you have a Java String, the information about the byte array is lost. You may have luck doing a "manual conversion". Create a list of all windows-1252 characters and their mapping to UTF-8. Then iterate over all characters in the string to convert them to the right encoding.

你想做的事是不可能的。一旦您拥有 Java 字符串,有关字节数组的信息就会丢失。您可能很幸运进行了“手动转换”。创建所有 windows-1252 字符及其到 UTF-8 的映射的列表。然后遍历字符串中的所有字符以将它们转换为正确的编码。

Edit:As a commenter said this won't work. When you convert a Windows-1252 byte array as it if was UTF-8 you are bound to get encoding exceptions. (See hereand here).

编辑:正如评论者所说,这行不通。当您将 Windows-1252 字节数组转换为 UTF-8 时,您一定会遇到编码异常。(见这里这里)。

回答by les2

I tried this and it worked for some reason

我试过了,它出于某种原因起作用了

Code to repair encoding problem (it doesn't work perfectly, which we will see shortly):

修复编码问题的代码(它不能完美运行,我们很快就会看到):

 final Charset fromCharset = Charset.forName("windows-1252");
 final Charset toCharset = Charset.forName("UTF-8");
 String fixed = new String(input.getBytes(fromCharset), toCharset);
 System.out.println(input);
 System.out.println(fixed);

The results are:

结果是:

 input: a|Und ich beweg mich (aber heut nur langsam)
 fixed: …Und ich beweg mich (aber heut nur langsam)

Here's another example:

这是另一个例子:

 input: Waun da wuan ned wa (feat. Wolfgang K??hn)
 fixed: Waun da wuan ned wa (feat. Wolfgang Kühn)

Here's what is happening and why the trick above seems to work:

这是正在发生的事情以及为什么上面的技巧似乎有效:

  1. The original file was a UTF-8 encoded text file (comma delimited)
  2. That file was imported with Excel BUT the user mistakenly entered Windows 1252 for the encoding (which was probably the default encoding on his or her computer)
  3. The user thought the import was successful because all of the characters in the ASCII range looked okay.
  1. 原始文件是一个 UTF-8 编码的文本文件(逗号分隔)
  2. 该文件是用 Excel 导入的,但用户错误地输入了 Windows 1252 进行编码(这可能是他或她计算机上的默认编码)
  3. 用户认为导入成功,因为 ASCII 范围内的所有字符看起来都没有问题。

Now, when we try to "reverse" the process, here is what happens:

现在,当我们尝试“逆转”该过程时,会发生以下情况:

 // we start with this garbage, two characters we don't want!
 String input = "??";

 final Charset cp1252 = Charset.forName("windows-1252");
 final Charset utf8 = Charset.forName("UTF-8");

 // lets convert it to bytes in windows-1252:
 // this gives you 2 bytes: c3 bc
 // "?" ==> c3
 // "?" ==> bc
 bytes[] windows1252Bytes = input.getBytes(cp1252);

 // but in utf-8, c3 bc is "ü"
 String fixed = new String(windows1252Bytes, utf8);

 System.out.println(input);
 System.out.println(fixed);

The encoding fixing code above kind of works but fails for the following characters:

上面的编码修复代码可以工作,但对于以下字符失败:

(Assuming the only characters used 1 byte characters from Windows 1252):

(假设唯一使用的字符是 Windows 1252 中的 1 个字节字符):

char    utf-8 bytes     |   string decoded as cp1252 -->   as cp1252 bytes 
”       e2 80 9d        |       a?                        e2 80 3f
á       c3 81           |       ??                         c3 3f
í       c3 8d           |       ??                         c3 3f
?       c3 8f           |       ??                         c3 3f
D       c3 90           |       ??                         c3 3f
Y       c3 9d           |       ??                         c3 3f

It does work for some of the characters, e.g. these:

它确实适用于某些字符,例如这些:

T       c3 9e           |       ??      c3 9e           T
?       c3 9f           |       ??      c3 9f           ?
à       c3 a0           |       ??      c3 a0           à
á       c3 a1           |       ??      c3 a1           á
a       c3 a2           |       ?¢      c3 a2           a
?       c3 a3           |       ?£      c3 a3           ?
?       c3 a4           |       ?¤      c3 a4           ?
?       c3 a5           |       ?¥      c3 a5           ?
?       c3 a6           |       ?|      c3 a6           ?
?       c3 a7           |       ?§      c3 a7           ?

NOTE - I originally thought this was relevant to your question (and as I was working on the same thing myself I figured I'd share what I've learned), but it seems my problem was slightly different. Maybe this will help someone else.

注意 - 我最初认为这与您的问题有关(并且当我自己在做同样的事情时,我想我会分享我学到的东西),但似乎我的问题略有不同。也许这会帮助别人。