java 字符串编码转换 UTF-8 到 SHIFT-JIS

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37155417/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 02:15:06  来源:igfitidea点击:

String encoding conversion UTF-8 to SHIFT-JIS

javastringencodingutf-8character-encoding

提问by David B

Variables used:

使用的变量:

  • JavaSE-6
  • No frameworks
  • JavaSE-6
  • 没有框架


Given this string input of ピーター?ジョーズwhich is encoded in UTF-8, I am having problems converting the said string to Shift-JISwithout the need of writing the said data to a file.

鉴于此字符串输入ピーター?ジョーズUTF-8编码,我在将所述字符串转换为Shift-JIS而无需将所述数据写入文件时遇到问题。

  • Input (UTF-8 encoding): ピーター?ジョーンズ
  • Output (SHIFT-JIS encoding): ピーター?ジョーンズ(SHIFT-JIS to be encoded)
  • 输入(UTF-8 编码): ピーター?ジョーンズ
  • 输出(SHIFT-JIS编码):(ピーター?ジョーンズ要编码的SHIFT-JIS)


I've tried this code snippets on how to convert UTF-8 strings to SHIFT-JIS:

我已经尝试了有关如何将 UTF-8 字符串转换为 SHIFT-JIS 的代码片段:

  • stringToEncode.getBytes(Charset.forName("SHIFT-JIS"))
  • new String(unecodedString.getBytes("SHIFT-JIS"), "UTF-8")
  • stringToEncode.getBytes(Charset.forName("SHIFT-JIS"))
  • new String(unecodedString.getBytes("SHIFT-JIS"), "UTF-8")

Both code snippets return this string output: ?s?[?^?[?E?W???[???Y(SHIFT-JIS encoded)

两个代码片段都返回此字符串输出:?s?[?^?[?E?W???[???Y(SHIFT-JIS 编码)

Any ideas on how this can be resolved?

关于如何解决这个问题的任何想法?

回答by Christoffer Hammarstr?m

Internally in Java, Strings are implemented as an array of UTF-16 code units. But this is an implementation detail, it would be possible to implement a JVM that uses a different encoding internally.

在 Java 内部,字符串被实现为一个 UTF-16 代码单元数组。但这是一个实现细节,可以实现一个在内部使用不同编码的 JVM。

(Note "encoding", "charset" and Charset are more or less synonyms.)

(注意“编码”、“字符集”和字符集或多或少是同义词。)

A String should be treated as a sequence of Unicode codepoints (even though in Java it's a sequence of UTF-16 code units).

一个字符串应该被视为一个 Unicode 代码点序列(即使在 Java 中它是一个 UTF-16 代码单元序列)。

If you have a String in your Java program, it is incorrectto say that it is a "UTF-8 String" or "String which is encoded in UTF-8". That does not make any sense, unless you're talking about the internal representation, which is irrelevant.

如果您的 Java 程序中有一个字符串,那么说它是“UTF-8 字符串”或“以 UTF-8 编码的字符串”是不正确的。这没有任何意义,除非您谈论的是无关紧要的内部表示。

What you can have is a sequence of bytes that decode to a String if you decode it using an encoding, such as UTF-8 or Shift-JIS.

如果您使用编码(例如 UTF-8 或 Shift-JIS)对其进行解码,则您可以拥有一个解码为字符串的字节序列。

Or you can have a String that encodes to a sequence of bytes if you encode it using an encoding, such as UTF-8 or Shift-JIS.

或者,如果您使用编码(例如 UTF-8 或 Shift-JIS)对其进行编码,则您可以拥有一个编码为字节序列的字符串。

In short, an encoding or Charset is a pair of two functions, "encode" and "decode" such that:

简而言之,编码或字符集是一对两个函数,“编码”和“解码”,使得:

// String -> encode -> bytes
byte[] bytes = string.getBytes(encoding);
// or using Charset
ByteBuffer byteBuffer = charset.encode(string);

// bytes -> decode -> String
String string = new String(bytes, encoding);
// or using Charset
String string = charset.decode(byteBuffer).toString();

So if you have a byte[] that's encoded using UTF-8:

因此,如果您有一个使用 UTF-8 编码的 byte[]:

byte[] utf8Bytes = "ピーター?ジョーズ".getBytes("UTF-8");
// utf8Bytes now contains, in hexadecimal
// e3 83 94  e3 83 bc  e3 82 bf   (ピ ー タ)
// e3 83 bc  e3 83 bb  e3 82 b8   (ー ? ジ)
// e3 83 a7  e3 83 bc  e3 82 ba   (ョ ー ズ)

You can create a String from those bytes using:

您可以使用以下方法从这些字节创建一个字符串:

String string = new String(utf8Bytes, "UTF-8");
// String now contains "ピーター?ジョーズ"

Then you can encode that String as Shift-JIS using:

然后您可以使用以下命令将该字符串编码为 Shift-JIS:

byte[] shiftJisBytes = string.getBytes("Shift-JIS");
// shiftJisBytes now contains, in hexadecimal
// 83 73  81 5b  83 5e   (ピ ー タ)
// 81 5b  81 45  83 57   (ー ? ジ)
// 83 87  81 5b  83 59   (ョ ー ズ)

Since those bytes represent a string encoded using Shift-JIS, trying to decode using UTF-8will produce garbage:

由于这些字节表示使用 编码的字符串Shift-JIS,因此尝试解码使用UTF-8会产生垃圾:

String garbage = new String(shiftJisBytes, "UTF-8")
// String now contains "?s?[?^?[?E?W???[?Y"
// ? is the character decoded when given an invalid UTF-8 sequence
// 83 73 81 5b 83 5e   (? s ? [ ? ^)
// 81 5b 81 45 83 57   (? [ ? E ? W)
// 83 87 81 5b 83 59   (? ? ? [ ? Y)

Further, remember that if you print a string to an output, for example System.out, that will use the system default encoding that is system dependent to convert the String to bytes. It looks like your system default is UTF-8.

此外,请记住,如果您将字符串打印到输出,例如System.out,将使用系统相关的系统默认编码将字符串转换为字节。看起来你的系统默认是UTF-8.

System.out.print(string);
// equivalent to:
System.out.write(string.getBytes(Charset.defaultCharset()));

Then if your output is for example the Windows console, it will then convert those bytes to a String using very probably a completely different encoding (probably CP437or CP850) before presenting it to you.

然后,如果您的输出是例如 Windows 控制台,那么它会使用很可能完全不同的编码(可能CP437CP850)将这些字节转换为字符串,然后再将其呈现给您。

This last part might be tripping you up.

最后一部分可能会让你绊倒。

回答by Chinbat G.

"MS932" instead of Shift-JIS/SJIS may make it.

可以使用“MS932”代替 Shift-JIS/SJIS。