java中的编码转换

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/229015/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 11:45:47  来源:igfitidea点击:

Encoding conversion in java

javacharacter-encodingconverters

提问by tropikalista

Is there any free java library which I can use to convert string in one encoding to other encoding, something like iconv? I'm using Java version 1.3.

是否有任何免费的 Java 库可用于将一种编码中的字符串转换为另一种编码,例如iconv?我使用的是 Java 1.3 版。

采纳答案by Jon Skeet

You don't need a library beyond the standard one - just use Charset. (You can just use the String constructors and getBytes methods, but personally I don't like just working with the names of character encodings. Too much room for typos.)

您不需要超出标准的库 - 只需使用Charset。(您可以只使用 String 构造函数和 getBytes 方法,但我个人不喜欢只使用字符编码的名称。拼写错误的空间太大。)

EDIT: As pointed out in comments, you can still use Charset instances but have the ease of use of the String methods: new String(bytes, charset)and String.getBytes(charset).

编辑:正如评论中所指出的,您仍然可以使用 Charset 实例,但可以轻松使用 String 方法:new String(bytes, charset)String.getBytes(charset)

See "URL Encoding (or: 'What are those "%20" codes in URLs?')".

请参阅“ URL 编码(或:'URL 中的那些“ %20”代码是什么?')”。

回答by VonC

CharsetDecodershould be what you are looking for, no ?

CharsetDecoder应该是你要找的,不是吗?

Many network protocols and files store their characters with a byte-oriented character set such as ISO-8859-1(ISO-Latin-1).
However, Java's native character encoding is UnicodeUTF16BE (Sixteen-bit UCS Transformation Format, big-endian byte order).

许多网络协议和文件使用面向字节的字符集存储它们的字符,例如ISO-8859-1( ISO-Latin-1)。
但是,Java 的本机字符编码是统一码UTF16BE(十六位 UCS 转换格式,大端字节序)。

See Charset. That doesn't mean UTF16is the default charset (i.e.: the default "mapping between sequences of sixteen-bit Unicode code unitsand sequences of bytes"):

Charset。这并不意味着UTF16是默认字符集(即:默认的“十六位Unicode 代码单元序列和字节序列之间的映射”):

Every instance of the Java virtual machine has a default charset, which may or may not be one of the standard charsets.
[US-ASCII, ISO-8859-1a.k.a. ISO-LATIN-1, UTF-8, UTF-16BE, UTF-16LE, UTF-16]
The default charset is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system.

Java 虚拟机的每个实例都有一个默认字符集,它可能是也可能不是标准字符集之一。
[ US-ASCII, ISO-8859-1aka ISO-LATIN-1, UTF-8, UTF-16BE, UTF-16LE, UTF-16]
默认字符集是在虚拟机启动期间确定的,通常取决于底层操作系统使用的区域设置和字符集。

This example demonstrates how to convert ISO-8859-1encoded bytes in a ByteBufferto a string in a CharBufferand visa versa.

此示例演示如何将 a 中的ISO-8859-1编码字节转换ByteBuffer为 a 中的字符串,CharBuffer反之亦然。

// Create the encoder and decoder for ISO-8859-1
Charset charset = Charset.forName("ISO-8859-1");
CharsetDecoder decoder = charset.newDecoder();
CharsetEncoder encoder = charset.newEncoder();

try {
    // Convert a string to ISO-LATIN-1 bytes in a ByteBuffer
    // The new ByteBuffer is ready to be read.
    ByteBuffer bbuf = encoder.encode(CharBuffer.wrap("a string"));

    // Convert ISO-LATIN-1 bytes in a ByteBuffer to a character ByteBuffer and then to a string.
    // The new ByteBuffer is ready to be read.
    CharBuffer cbuf = decoder.decode(bbuf);
    String s = cbuf.toString();
} catch (CharacterCodingException e) {
}

回答by Jon Skeet

It is a whole lot easier if you think of unicode as a character set (which it actually is - it is very basically the numbered set of all known characters). You can encode it as UTF-8 (1-3 bytes per character depending) or maybe UTF-16 (2 bytes per character or 4 bytes using surrogate pairs).

如果您将 unicode 视为一个字符集(它实际上是 - 它基本上是所有已知字符的编号集),那就容易多了。您可以将其编码为 UTF-8(每个字符 1-3 个字节,具体取决于)或 UTF-16(每个字符 2 个字节或使用代理对的 4 个字节)。

Back in the mist of time Java used to use UCS-2 to encode the unicode character set. This could only handle 2 bytes per character and is now obsolete. It was a fairly obvious hack to add surrogate pairs and move up to UTF-16.

回到过去,Java 曾经使用 UCS-2 来编码 unicode 字符集。这只能处理每个字符 2 个字节,现在已经过时了。添加代理对并向上移动到 UTF-16 是一个相当明显的技巧。

A lot of people think they should have used UTF-8 in the first place. When Java was originally written unicode had far more than 65535 characters anyway...

很多人认为他们应该首先使用 UTF-8。当 Java 最初被编写时,unicode 无论如何都远远超过 65535 个字符......

回答by brijesh k

UTF-8 and UCS-2/UTF-16 can be distinguished reasonably easily via a byte order mark at the start of the file. If this exists then it's a pretty good bet that the file is in that encoding - but it's not a dead certainty. You may well also find that the file is in one of those encodings, but doesn't have a byte order mark.

通过文件开头的字节顺序标记,可以很容易地区分 UTF-8 和 UCS-2/UTF-16。如果它存在,那么可以很好地打赌该文件采用该编码 - 但这并不是绝对的确定性。您可能还会发现该文件采用其中一种编码,但没有字节顺序标记。

I don't know much about ISO-8859-2, but I wouldn't be surprised if almost every file is a valid text file in that encoding. The best you'll be able to do is check it heuristically. Indeed, the Wikipedia page talking about it would suggest that only byte 0x7f is invalid.

我对 ISO-8859-2 了解不多,但如果几乎每个文件都是该编码的有效文本文件,我也不会感到惊讶。您能做的最好的事情就是启发式地检查它。事实上,谈论它的维基百科页面会表明只有字节 0x7f 是无效的。

There's no idea of reading a file "as it is" and yet getting text out - a file is a sequence of bytes, so you have to apply a character encoding in order to decode those bytes into characters.

不知道“按原样”读取文件并输出文本 - 文件是一个字节序列,因此您必须应用字符编码才能将这些字节解码为字符。

Source by stackoverflow

来自stackoverflow的来源

回答by wallabui

I would just like to add that if the String is originally encoded using the wrong encoding it might be impossible to change it to another encoding without errors. The question does not state that the conversion here is made from wrong encoding to correct encoding but I personally stumbled to this question just because of this situation so just a heads up for others as well.

我只想补充一点,如果字符串最初是使用错误的编码进行编码的,则可能无法将其更改为另一种编码而不会出错。这个问题并没有说明这里的转换是从错误的编码到正确的编码,但我个人偶然发现了这个问题,只是因为这种情况,所以也只是提醒其他人。

This answer in other question gives an explanation why the conversion does not always yield correct results https://stackoverflow.com/a/2623793/4702806

其他问题中的这个答案解释了为什么转换并不总是产生正确的结果 https://stackoverflow.com/a/2623793/4702806