java 字符串编码转换 UTF-8 到 SHIFT-JIS

Question

提问by David B

Variables used:

使用的变量：

JavaSE-6
No frameworks

JavaSE-6
没有框架

Given this string input of ピーター?ジョーズwhich is encoded in UTF-8, I am having problems converting the said string to Shift-JISwithout the need of writing the said data to a file.

鉴于此字符串输入ピーター?ジョーズ以UTF-8编码，我在将所述字符串转换为Shift-JIS而无需将所述数据写入文件时遇到问题。

Input (UTF-8 encoding): ピーター?ジョーンズ
Output (SHIFT-JIS encoding): ピーター?ジョーンズ(SHIFT-JIS to be encoded)

输入（UTF-8 编码）： ピーター?ジョーンズ
输出（SHIFT-JIS编码）：（ピーター?ジョーンズ要编码的SHIFT-JIS）

I've tried this code snippets on how to convert UTF-8 strings to SHIFT-JIS:

我已经尝试了有关如何将 UTF-8 字符串转换为 SHIFT-JIS 的代码片段：

stringToEncode.getBytes(Charset.forName("SHIFT-JIS"))
new String(unecodedString.getBytes("SHIFT-JIS"), "UTF-8")

stringToEncode.getBytes(Charset.forName("SHIFT-JIS"))
new String(unecodedString.getBytes("SHIFT-JIS"), "UTF-8")

Both code snippets return this string output: ?s?[?^?[?E?W???[???Y(SHIFT-JIS encoded)

两个代码片段都返回此字符串输出：?s?[?^?[?E?W???[???Y（SHIFT-JIS 编码）

Any ideas on how this can be resolved?

关于如何解决这个问题的任何想法？

Answer 1

回答by Christoffer Hammarstr?m

Internally in Java, Strings are implemented as an array of UTF-16 code units. But this is an implementation detail, it would be possible to implement a JVM that uses a different encoding internally.

在 Java 内部，字符串被实现为一个 UTF-16 代码单元数组。但这是一个实现细节，可以实现一个在内部使用不同编码的 JVM。

(Note "encoding", "charset" and Charset are more or less synonyms.)

（注意“编码”、“字符集”和字符集或多或少是同义词。）

A String should be treated as a sequence of Unicode codepoints (even though in Java it's a sequence of UTF-16 code units).

一个字符串应该被视为一个 Unicode 代码点序列（即使在 Java 中它是一个 UTF-16 代码单元序列）。

If you have a String in your Java program, it is incorrectto say that it is a "UTF-8 String" or "String which is encoded in UTF-8". That does not make any sense, unless you're talking about the internal representation, which is irrelevant.

如果您的 Java 程序中有一个字符串，那么说它是“UTF-8 字符串”或“以 UTF-8 编码的字符串”是不正确的。这没有任何意义，除非您谈论的是无关紧要的内部表示。

What you can have is a sequence of bytes that decode to a String if you decode it using an encoding, such as UTF-8 or Shift-JIS.

如果您使用编码（例如 UTF-8 或 Shift-JIS）对其进行解码，则您可以拥有一个解码为字符串的字节序列。

Or you can have a String that encodes to a sequence of bytes if you encode it using an encoding, such as UTF-8 or Shift-JIS.

或者，如果您使用编码（例如 UTF-8 或 Shift-JIS）对其进行编码，则您可以拥有一个编码为字节序列的字符串。

In short, an encoding or Charset is a pair of two functions, "encode" and "decode" such that:

简而言之，编码或字符集是一对两个函数，“编码”和“解码”，使得：

// String -> encode -> bytes
byte[] bytes = string.getBytes(encoding);
// or using Charset
ByteBuffer byteBuffer = charset.encode(string);

// bytes -> decode -> String
String string = new String(bytes, encoding);
// or using Charset
String string = charset.decode(byteBuffer).toString();

So if you have a byte[] that's encoded using UTF-8:

因此，如果您有一个使用 UTF-8 编码的 byte[]：

byte[] utf8Bytes = "ピーター?ジョーズ".getBytes("UTF-8");
// utf8Bytes now contains, in hexadecimal
// e3 83 94  e3 83 bc  e3 82 bf   (ピ ー タ)
// e3 83 bc  e3 83 bb  e3 82 b8   (ー ? ジ)
// e3 83 a7  e3 83 bc  e3 82 ba   (ョ ー ズ)

You can create a String from those bytes using:

您可以使用以下方法从这些字节创建一个字符串：

String string = new String(utf8Bytes, "UTF-8");
// String now contains "ピーター?ジョーズ"

Then you can encode that String as Shift-JIS using:

然后您可以使用以下命令将该字符串编码为 Shift-JIS：

byte[] shiftJisBytes = string.getBytes("Shift-JIS");
// shiftJisBytes now contains, in hexadecimal
// 83 73  81 5b  83 5e   (ピ ー タ)
// 81 5b  81 45  83 57   (ー ? ジ)
// 83 87  81 5b  83 59   (ョ ー ズ)

Since those bytes represent a string encoded using Shift-JIS, trying to decode using UTF-8will produce garbage:

由于这些字节表示使用编码的字符串Shift-JIS，因此尝试解码使用UTF-8会产生垃圾：

String garbage = new String(shiftJisBytes, "UTF-8")
// String now contains "?s?[?^?[?E?W???[?Y"
// ? is the character decoded when given an invalid UTF-8 sequence
// 83 73 81 5b 83 5e   (? s ? [ ? ^)
// 81 5b 81 45 83 57   (? [ ? E ? W)
// 83 87 81 5b 83 59   (? ? ? [ ? Y)

Further, remember that if you print a string to an output, for example System.out, that will use the system default encoding that is system dependent to convert the String to bytes. It looks like your system default is UTF-8.

此外，请记住，如果您将字符串打印到输出，例如System.out，将使用系统相关的系统默认编码将字符串转换为字节。看起来你的系统默认是UTF-8.

System.out.print(string);
// equivalent to:
System.out.write(string.getBytes(Charset.defaultCharset()));

Then if your output is for example the Windows console, it will then convert those bytes to a String using very probably a completely different encoding (probably CP437or CP850) before presenting it to you.

然后，如果您的输出是例如 Windows 控制台，那么它会使用很可能完全不同的编码（可能CP437或CP850）将这些字节转换为字符串，然后再将其呈现给您。

This last part might be tripping you up.

最后一部分可能会让你绊倒。

Answer 2

回答by Chinbat G.

"MS932" instead of Shift-JIS/SJIS may make it.

可以使用“MS932”代替 Shift-JIS/SJIS。

java 字符串编码转换 UTF-8 到 SHIFT-JIS

提问by David B

回答by Christoffer Hammarstr?m

回答by Chinbat G.

相关推荐

最近更新

标签

java 字符串编码转换 UTF-8 到 SHIFT-JIS

提问by David B

回答by Christoffer Hammarstr?m

回答by Chinbat G.

相关推荐

java 如何在数组列表中搜索项目？

java 使用 Mockito 和 Junit 时如何 AutoWire spring beans？

java maven：将资源从依赖的 jar 复制到目标文件夹

未找到默认构造函数；嵌套异常是 java.lang.NoSuchMethodException bean 配置

相关推荐

最近更新

标签