Java中的“修复”字符串编码
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2622911/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
"Fix" String encoding in Java
提问by Nico
I have a String
created from a byte[]
array, using UTF-8 encoding.
However, it should have been created using another encoding (Windows-1252).
我有一个String
从byte[]
数组创建的,使用 UTF-8 编码。
但是,它应该是使用另一种编码 (Windows-1252) 创建的。
Is there a way to convert this String back to the right encoding?
有没有办法将此字符串转换回正确的编码?
I know it's easy to do if you have access to the original byte array, but it my case it's too late because it's given by a closed source library.
我知道如果您可以访问原始字节数组,这很容易做到,但就我而言,为时已晚,因为它是由封闭源库提供的。
采纳答案by Joachim Sauer
As there seems to be some confusion on whether this is possible or not I think I'll need to provide an extensive example.
由于似乎对这是否可能存在一些混淆,我认为我需要提供一个广泛的例子。
The question claims that the (initial) input is a byte[]
that contains Windows-1252encoded data. I'll call that byte[]
ib
(for "initial bytes").
该问题声称(初始)输入是byte[]
包含Windows-1252编码数据的 a 。我会称之为byte[]
ib
(对于“初始字节”)。
For this example I'll choose the German word "B?r" (meaning bear) as the input:
对于这个例子,我将选择德语单词“B?r”(意思是熊)作为输入:
byte[] ib = new byte[] { (byte) 0x42, (byte) 0xE4, (byte) 0x72 };
String correctString = new String(ib, "Windows-1252");
assert correctString.charAt(1) == '\u00E4'; //verify that the character was correctly decoded.
(If your JVM doesn't support that encoding, then you can use ISO-8859-1 instead, because those three letters (and most others) are at the same position in those two encodings).
(如果您的 JVM 不支持该编码,那么您可以改用 ISO-8859-1,因为这三个字母(和大多数其他字母)在这两种编码中位于相同的位置)。
The question goes on to state that some other code (that is outside of our influence) already converted that byte[]
to a String using the UTF-8 encoding (I'll call that String
is
for "input String"). That String
is the only inputthat is available to achieve our goal (if is
were available, it would be trivial):
问题继续说明一些其他代码(不在我们的影响范围内)已经byte[]
使用 UTF-8 编码将其转换为字符串(我将其称为String
is
“输入字符串”)。这String
是实现我们目标唯一可用的输入(如果is
可用,那将是微不足道的):
String is = new String(ib, "UTF-8");
System.out.println(is);
This obviously produces the incorrect output "B?".
这显然会产生不正确的输出“B?”。
The goal would be to produce ib
(or the correct decoding of that byte[]
) with onlyis
available.
目标将是仅在可用的情况下生成ib
(或正确解码byte[]
)。is
Now some people claim that getting the UTF-8 encoded bytes from that is
will return an array with the same values as the initial array:
现在有些人声称从中is
获取 UTF-8 编码的字节将返回一个与初始数组具有相同值的数组:
byte[] utf8Again = is.getBytes("UTF-8");
But that returns the UTF-8 encoding of the two characters B
and ?
and definitely returns the wrong result when re-interpreted as Windows-1252:
但这会返回两个字符的 UTF-8 编码,B
并且?
在重新解释为 Windows-1252 时肯定会返回错误的结果:
System.out.println(new String(utf8Again, "Windows-1252");
This line produces the output "B???", which is totally wrong (it is also the same output that would be the result if the initial array contained the non-word "Bür" instead).
这一行产生输出“B???”,这是完全错误的(如果初始数组包含非单词“Bür”,它也是相同的输出)。
So in this caseyou can't undo the operation, because information is lost.
所以在这种情况下你不能撤消操作,因为信息丢失了。
There arein fact cases where such mis-encodings can be undone. It's more likely to work, when all possible (or at least occuring) byte sequences are valid in that encoding. Since UTF-8 has several byte sequences that are simply not valid values, you willhave problems.
有是实际上情况下,这种错误的编码可以撤消。当所有可能(或至少出现)的字节序列在该编码中都有效时,它更有可能起作用。由于 UTF-8 有几个字节序列根本不是有效值,因此您会遇到问题。
回答by LB40
回答by kgiannakakis
What you want to do is impossible. Once you have a Java String, the information about the byte array is lost. You may have luck doing a "manual conversion". Create a list of all windows-1252 characters and their mapping to UTF-8. Then iterate over all characters in the string to convert them to the right encoding.
你想做的事是不可能的。一旦您拥有 Java 字符串,有关字节数组的信息就会丢失。您可能很幸运进行了“手动转换”。创建所有 windows-1252 字符及其到 UTF-8 的映射的列表。然后遍历字符串中的所有字符以将它们转换为正确的编码。
Edit:As a commenter said this won't work. When you convert a Windows-1252 byte array as it if was UTF-8 you are bound to get encoding exceptions. (See hereand here).
编辑:正如评论者所说,这行不通。当您将 Windows-1252 字节数组转换为 UTF-8 时,您一定会遇到编码异常。(见这里和这里)。
回答by les2
I tried this and it worked for some reason
我试过了,它出于某种原因起作用了
Code to repair encoding problem (it doesn't work perfectly, which we will see shortly):
修复编码问题的代码(它不能完美运行,我们很快就会看到):
final Charset fromCharset = Charset.forName("windows-1252");
final Charset toCharset = Charset.forName("UTF-8");
String fixed = new String(input.getBytes(fromCharset), toCharset);
System.out.println(input);
System.out.println(fixed);
The results are:
结果是:
input: a|Und ich beweg mich (aber heut nur langsam)
fixed: …Und ich beweg mich (aber heut nur langsam)
Here's another example:
这是另一个例子:
input: Waun da wuan ned wa (feat. Wolfgang K??hn)
fixed: Waun da wuan ned wa (feat. Wolfgang Kühn)
Here's what is happening and why the trick above seems to work:
这是正在发生的事情以及为什么上面的技巧似乎有效:
- The original file was a UTF-8 encoded text file (comma delimited)
- That file was imported with Excel BUT the user mistakenly entered Windows 1252 for the encoding (which was probably the default encoding on his or her computer)
- The user thought the import was successful because all of the characters in the ASCII range looked okay.
- 原始文件是一个 UTF-8 编码的文本文件(逗号分隔)
- 该文件是用 Excel 导入的,但用户错误地输入了 Windows 1252 进行编码(这可能是他或她计算机上的默认编码)
- 用户认为导入成功,因为 ASCII 范围内的所有字符看起来都没有问题。
Now, when we try to "reverse" the process, here is what happens:
现在,当我们尝试“逆转”该过程时,会发生以下情况:
// we start with this garbage, two characters we don't want!
String input = "??";
final Charset cp1252 = Charset.forName("windows-1252");
final Charset utf8 = Charset.forName("UTF-8");
// lets convert it to bytes in windows-1252:
// this gives you 2 bytes: c3 bc
// "?" ==> c3
// "?" ==> bc
bytes[] windows1252Bytes = input.getBytes(cp1252);
// but in utf-8, c3 bc is "ü"
String fixed = new String(windows1252Bytes, utf8);
System.out.println(input);
System.out.println(fixed);
The encoding fixing code above kind of works but fails for the following characters:
上面的编码修复代码可以工作,但对于以下字符失败:
(Assuming the only characters used 1 byte characters from Windows 1252):
(假设唯一使用的字符是 Windows 1252 中的 1 个字节字符):
char utf-8 bytes | string decoded as cp1252 --> as cp1252 bytes
” e2 80 9d | a? e2 80 3f
á c3 81 | ?? c3 3f
í c3 8d | ?? c3 3f
? c3 8f | ?? c3 3f
D c3 90 | ?? c3 3f
Y c3 9d | ?? c3 3f
It does work for some of the characters, e.g. these:
它确实适用于某些字符,例如这些:
T c3 9e | ?? c3 9e T
? c3 9f | ?? c3 9f ?
à c3 a0 | ?? c3 a0 à
á c3 a1 | ?? c3 a1 á
a c3 a2 | ?¢ c3 a2 a
? c3 a3 | ?£ c3 a3 ?
? c3 a4 | ?¤ c3 a4 ?
? c3 a5 | ?¥ c3 a5 ?
? c3 a6 | ?| c3 a6 ?
? c3 a7 | ?§ c3 a7 ?
NOTE - I originally thought this was relevant to your question (and as I was working on the same thing myself I figured I'd share what I've learned), but it seems my problem was slightly different. Maybe this will help someone else.
注意 - 我最初认为这与您的问题有关(并且当我自己在做同样的事情时,我想我会分享我学到的东西),但似乎我的问题略有不同。也许这会帮助别人。