java Java中的日语字符编码

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7698794/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 21:02:35  来源:igfitidea点击:

Japanese Character Encoding in Java

javaunicodecjk

提问by Allan Jiang

Here's my problem. I'm now using using Java Apache POI to read an Excel (.xls or .xlsx) file, and display the contents. There are some Japanese chars in the spreadsheet and all of the Japanese chars I got are "???" in my output. I tried to use Shift-JIS, UTF-8 and many other encoding ways, but it doesn't work... Here's my encoding code below:

这是我的问题。我现在使用 Java Apache POI 来读取 Excel(.xls 或 .xlsx)文件,并显示内容。电子表格中有一些日文字符,我得到的所有日文字符都是“???” 在我的输出中。我尝试使用 Shift-JIS、UTF-8 和许多其他编码方式,但它不起作用......下面是我的编码代码:

public String encoding(String str) throws UnsupportedEncodingException{
  String Encoding = "Shift_JIS";
  return this.changeCharset(str, Encoding);
}
public String changeCharset(String str, String newCharset) throws UnsupportedEncodingException {
  if (str != null) {
    byte[] bs = str.getBytes();
    return new String(bs, newCharset);
  }
  return null;
}

I am passing in every string I got to encoding(str). But when I print the return value, it's still something like "???" (Like below) but not Japanese characters (Hiragana, Katakana or Kanji).

我正在传递我要编码的每个字符串(str)。但是当我打印返回值时,它仍然是“???”之类的东西 (如下所示)但不是日语字符(平假名、片假名或汉字)。

title-jp=???

Anyone can help me with this? Thank you so much.

任何人都可以帮助我吗?太感谢了。

采纳答案by Daniel Earwicker

Your changeCharsetmethod seems strange. Stringobjects in Java are best thought of as not have a specific character set. They use Unicode and so can represent all characters, not only one regional subset. Your method says: turn the string into bytes using my system's character set (whatever that may be), and then try and interpret those bytes using some other character set (specified in newCharset), which therefore probably won't work. If you convert to bytes in an encoding, you should read those bytes with the same encoding.

你的changeCharset方法看起来很奇怪。StringJava 中的对象最好被认为没有特定的字符集。它们使用 Unicode,因此可以表示所有字符,而不仅仅是一个区域子集。您的方法说:使用我的系统的字符集(无论是什么)将字符串转换为字节,然后尝试使用其他字符集(在 中指定newCharset)解释这些字节,因此这可能不起作用。如果在编码中转换为字节,则应该使用相同的编码读取这些字节。

Update:

更新

To convert a String to Shift-JIS (a regional encoding commonly used in Japan) you can say:

要将字符串转换为 Shift-JIS(日本常用的区域编码),您可以说:

byte[] jis = str.getBytes("Shift_JIS");

If you write those bytes into a file, and then open the file in Notepad on a Windows computer where the regional settings are all Japan-centric, Notepad will display it in Japanese (having nothing else to go on, it will assume the text is in the system's local encoding).

如果您将这些字节写入文件,然后在区域设置全部以日本为中心的 Windows 计算机上的记事本中打开该文件,记事本将以日语显示它(没有别的可继续,它会假设文本是在系统的本地编码中)。

However, you could equally well save it as UTF-8 (prefixed with the 3-byte UTF-8 introducer sequence) and Notepad will also display it as Japanese. Shift-JIS is only oneway of representing Japanese text as bytes.

但是,您同样可以将其保存为 UTF-8(以 3 字节 UTF-8 介绍序列为前缀),记事本也会将其显示为日语。Shift-JIS 只是将日语文本表示为字节的一种方式。

回答by Jon Skeet

I suspect you shouldn't be doing this in the first place. If it really is Apache POI's fault, then you'll need to get the original raw bytes from the data, notjust use the system default encdoing.

我怀疑你一开始就不应该这样做。如果确实是 Apache POI 的错,那么您需要从数据中获取原始原始字节,不仅仅是使用系统默认编码。

On the other hand, I think it's entirely likely that Apache POI has managed to do the right thing, and it's just an output problem. I suggest you dump the original string you've got (removing your encodingmethod entirely) in terms of its Unicode code points, e.g.

另一方面,我认为很可能 Apache POI 已经设法做正确的事情,这只是一个输出问题。我建议您encoding根据 Unicode 代码点转储您拥有的原始字符串(完全删除您的方法),例如

 for (int i = 0; i < text.length; i++) {
     System.out.println("U+" + Integer.toHexString(text.charAt(i));
 }

Then check those Unicode values against the ones at the Unicode web site.

然后根据 Unicode 网站上的值检查这些 Unicode 值。