java 字符串编码转换 UTF-8 到 SHIFT-JIS
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37155417/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
String encoding conversion UTF-8 to SHIFT-JIS
提问by David B
Variables used:
使用的变量:
- JavaSE-6
- No frameworks
- JavaSE-6
- 没有框架
Given this string input of ピーター?ジョーズ
which is encoded in UTF-8, I am having problems converting the said string to Shift-JISwithout the need of writing the said data to a file.
鉴于此字符串输入ピーター?ジョーズ
以UTF-8编码,我在将所述字符串转换为Shift-JIS而无需将所述数据写入文件时遇到问题。
- Input (UTF-8 encoding):
ピーター?ジョーンズ
- Output (SHIFT-JIS encoding):
ピーター?ジョーンズ
(SHIFT-JIS to be encoded)
- 输入(UTF-8 编码):
ピーター?ジョーンズ
- 输出(SHIFT-JIS编码):(
ピーター?ジョーンズ
要编码的SHIFT-JIS)
I've tried this code snippets on how to convert UTF-8 strings to SHIFT-JIS:
我已经尝试了有关如何将 UTF-8 字符串转换为 SHIFT-JIS 的代码片段:
stringToEncode.getBytes(Charset.forName("SHIFT-JIS"))
new String(unecodedString.getBytes("SHIFT-JIS"), "UTF-8")
stringToEncode.getBytes(Charset.forName("SHIFT-JIS"))
new String(unecodedString.getBytes("SHIFT-JIS"), "UTF-8")
Both code snippets return this string output: ?s?[?^?[?E?W???[???Y
(SHIFT-JIS encoded)
两个代码片段都返回此字符串输出:?s?[?^?[?E?W???[???Y
(SHIFT-JIS 编码)
Any ideas on how this can be resolved?
关于如何解决这个问题的任何想法?
回答by Christoffer Hammarstr?m
Internally in Java, Strings are implemented as an array of UTF-16 code units. But this is an implementation detail, it would be possible to implement a JVM that uses a different encoding internally.
在 Java 内部,字符串被实现为一个 UTF-16 代码单元数组。但这是一个实现细节,可以实现一个在内部使用不同编码的 JVM。
(Note "encoding", "charset" and Charset are more or less synonyms.)
(注意“编码”、“字符集”和字符集或多或少是同义词。)
A String should be treated as a sequence of Unicode codepoints (even though in Java it's a sequence of UTF-16 code units).
一个字符串应该被视为一个 Unicode 代码点序列(即使在 Java 中它是一个 UTF-16 代码单元序列)。
If you have a String in your Java program, it is incorrectto say that it is a "UTF-8 String" or "String which is encoded in UTF-8". That does not make any sense, unless you're talking about the internal representation, which is irrelevant.
如果您的 Java 程序中有一个字符串,那么说它是“UTF-8 字符串”或“以 UTF-8 编码的字符串”是不正确的。这没有任何意义,除非您谈论的是无关紧要的内部表示。
What you can have is a sequence of bytes that decode to a String if you decode it using an encoding, such as UTF-8 or Shift-JIS.
如果您使用编码(例如 UTF-8 或 Shift-JIS)对其进行解码,则您可以拥有一个解码为字符串的字节序列。
Or you can have a String that encodes to a sequence of bytes if you encode it using an encoding, such as UTF-8 or Shift-JIS.
或者,如果您使用编码(例如 UTF-8 或 Shift-JIS)对其进行编码,则您可以拥有一个编码为字节序列的字符串。
In short, an encoding or Charset is a pair of two functions, "encode" and "decode" such that:
简而言之,编码或字符集是一对两个函数,“编码”和“解码”,使得:
// String -> encode -> bytes
byte[] bytes = string.getBytes(encoding);
// or using Charset
ByteBuffer byteBuffer = charset.encode(string);
// bytes -> decode -> String
String string = new String(bytes, encoding);
// or using Charset
String string = charset.decode(byteBuffer).toString();
So if you have a byte[] that's encoded using UTF-8:
因此,如果您有一个使用 UTF-8 编码的 byte[]:
byte[] utf8Bytes = "ピーター?ジョーズ".getBytes("UTF-8");
// utf8Bytes now contains, in hexadecimal
// e3 83 94 e3 83 bc e3 82 bf (ピ ー タ)
// e3 83 bc e3 83 bb e3 82 b8 (ー ? ジ)
// e3 83 a7 e3 83 bc e3 82 ba (ョ ー ズ)
You can create a String from those bytes using:
您可以使用以下方法从这些字节创建一个字符串:
String string = new String(utf8Bytes, "UTF-8");
// String now contains "ピーター?ジョーズ"
Then you can encode that String as Shift-JIS using:
然后您可以使用以下命令将该字符串编码为 Shift-JIS:
byte[] shiftJisBytes = string.getBytes("Shift-JIS");
// shiftJisBytes now contains, in hexadecimal
// 83 73 81 5b 83 5e (ピ ー タ)
// 81 5b 81 45 83 57 (ー ? ジ)
// 83 87 81 5b 83 59 (ョ ー ズ)
Since those bytes represent a string encoded using Shift-JIS
, trying to decode using UTF-8
will produce garbage:
由于这些字节表示使用 编码的字符串Shift-JIS
,因此尝试解码使用UTF-8
会产生垃圾:
String garbage = new String(shiftJisBytes, "UTF-8")
// String now contains "?s?[?^?[?E?W???[?Y"
// ? is the character decoded when given an invalid UTF-8 sequence
// 83 73 81 5b 83 5e (? s ? [ ? ^)
// 81 5b 81 45 83 57 (? [ ? E ? W)
// 83 87 81 5b 83 59 (? ? ? [ ? Y)
Further, remember that if you print a string to an output, for example System.out
, that will use the system default encoding that is system dependent to convert the String to bytes. It looks like your system default is UTF-8
.
此外,请记住,如果您将字符串打印到输出,例如System.out
,将使用系统相关的系统默认编码将字符串转换为字节。看起来你的系统默认是UTF-8
.
System.out.print(string);
// equivalent to:
System.out.write(string.getBytes(Charset.defaultCharset()));
Then if your output is for example the Windows console, it will then convert those bytes to a String using very probably a completely different encoding (probably CP437
or CP850
) before presenting it to you.
然后,如果您的输出是例如 Windows 控制台,那么它会使用很可能完全不同的编码(可能CP437
或CP850
)将这些字节转换为字符串,然后再将其呈现给您。
This last part might be tripping you up.
最后一部分可能会让你绊倒。
回答by Chinbat G.
"MS932" instead of Shift-JIS/SJIS may make it.
可以使用“MS932”代替 Shift-JIS/SJIS。