Java UTF-8 编码;只有一些日语字符没有被转换

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24009119/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-14 09:41:13  来源:igfitidea点击:

UTF-8 Encoding ; Only some Japanese characters are not getting converted

javaencodingutf-8character-encodingutf

提问by Janak

I am getting the parameter value as parameter from the Jersey Web Service, which is in Japaneses characters.

我从Jersey Web Service获取参数值作为参数,它是日语字符。

Here, 'japaneseString'is the web service parameter containing the characters in japanese language.

此处,'japaneseString'是包含日语字符的 Web 服务参数。

   String name = new String(japaneseString.getBytes(), "UTF-8");

However, I am able to convert a few sting literals successfully, while some of them are creating problems.

但是,我能够成功转换一些 sting 文字,而其中一些会产生问题。

The following were successfully converted:

以下内容已成功转换:

 1) アップル
 2) 赤
 3) 世丕且且世两上与丑万丣丕且丗丕
 4) 世世丗丈

While these din't:

虽然这些不是:

 1) ひほわれよう
 2) 存在する

When I further investigated, i found that these 2 strings are getting converted in to some JUNK characters.

当我进一步调查时,我发现这两个字符串被转换为一些垃圾字符。

 1) Input: ひほわれよう        Output : ????????れよ???
 2) Input: 存在する            Output: 存在???る

Any idea why some of the japanese characters are not converted properly?

知道为什么某些日语字符没有正确转换吗?

Thanks.

谢谢。

采纳答案by Nitul

Try with JVM parameter file.encoding to set with value UTF-8 in startup of Tomcat(JVM). E.x.: -Dfile.encoding=UTF-8

在Tomcat(JVM)启动时尝试使用JVM参数file.encoding设置为UTF-8值。例如:-Dfile.encoding=UTF-8

回答by fge

You are mixing concepts here.

你在这里混合概念。

A Stringis just a sequence of characters (chars); a Stringin itself has no encoding at all. For what it's worth, replace charactersin the above with carrier pigeons. Same thing. A carrier pigeon has no encoding. Neither does a char. (1)

AString只是一个字符序列 ( chars);aString本身根本没有编码。对于它的价值,characters将上面的替换为carrier pigeons. 一样。信鸽没有编码。也没有char。(1)

What you are doing here:

你在这里做什么:

new String(x.getBytes(), "UTF-8")

is a "poor man's encoding/decoding process". You will probably have noticed that there are two versions of .getBytes(): one where you pass a charset as an argument and the other where you don't.

是一个“穷人的编码/解码过程”。您可能已经注意到有两种版本.getBytes():一种是将字符集作为参数传递,另一种是不传递。

If you don't, and that is what happens here, it means you will get the result of the encoding processusing your defaultcharacter set; and then you try and re-decode this byte sequence using UTF-8.

如果不这样做,这就是这里发生的情况,这意味着您将使用默认字符集获得编码过程的结果;然后你尝试使用 UTF-8 重新解码这个字节序列。

Don't do that. Just take in the string as it comes. If, however, you have trouble reading the original byte stream into a string, it means you use a Readerwith the wrong charset. Fix thatpart.

不要那样做。只需在字符串出现时接收它。但是,如果您无法将原始字节流读入字符串,则意味着您使用了Reader错误的字符集。修复部分。

For more information, read this link.

有关更多信息,请阅读此链接

(1) the fact that, in fact, a charis a UTF-16 code unit is irrelevant to this discussion

(1) 事实上,achar是一个 UTF-16 代码单元这一事实与本次讨论无关

回答by Joop Eggen

I concur with @fge.

我同意@fge。

Clarification

澄清

In java String/char/Reader/Writerhandle (Unicode) text, and can combine all scripts in the world.

在javaString/char/Reader/Writer句柄(Unicode)文本中,可以组合世界上所有的脚本。

And byte[]/InputStream/OutputStreamare binary data, which need an indication of some encoding to be converted to String.

并且byte[]/InputStream/OutputStream是二进制数据,需要指示某种编码才能转换为字符串。

In your case japaneseStingrshould already be a correct String, or be substituted by the original byte[].

在你的情况下japaneseStingr应该已经是一个正确的字符串,或者被原始的byte[].

Traps in Java

Java 中的陷阱

Encoding often is an optional parameter, which then defaults to the platform encoding. You fell in that trap too:

编码通常是一个可选参数,然后默认为平台编码。你也掉进了那个陷阱:

String s = "...";
byte[] b = s.getBytes(); // Platform encoding, non-portable.
byte[] b = s.getBytes("UTF-8"); // Explicit
byte[] b = s.getBytes(StandardCharsets.UTF_8); // Explicit,
                         //  better (for UTF-8, ISO-8859-1)

In general avoid the overloaded methods without encoding parameter, as they are for current-computer only data: non-portable. For completeness: classes FileReader/FileWriter should be avoided as they even provide no encoding parameters.

通常避免没有编码参数的重载方法,因为它们仅用于当前计算机的数据:不可移植。为了完整性:应该避免类 FileReader/FileWriter,因为它们甚至不提供编码参数。

Error

错误

japaneseStringis already wrong. So you have to read that right. It could have been read erroneouslyas Windows-1252 (Windows Latin-1) and suffered when recoding to UTF-8. Evidently only some cases get messed up.

japaneseString已经错了。所以你必须正确阅读。它可能被错误地读取为 Windows-1252 (Windows Latin-1) 并在重新编码为 UTF-8 时受到影响。显然只有某些情况会被搞砸。

Maybe you had:

也许你有:

String japanesString = new String(bytes);

instead of:

代替:

String japanesString = new String(bytes, StandardCharsets.UTF_8);

At the end:

在末尾:

String name = japaneseString;

Show the code for reading japaneseString for further help.

显示读取 japaneseString 的代码以获得进一步的帮助。