java中的编码转换
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/229015/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Encoding conversion in java
提问by tropikalista
采纳答案by Jon Skeet
You don't need a library beyond the standard one - just use Charset. (You can just use the String constructors and getBytes methods, but personally I don't like just working with the names of character encodings. Too much room for typos.)
您不需要超出标准的库 - 只需使用Charset。(您可以只使用 String 构造函数和 getBytes 方法,但我个人不喜欢只使用字符编码的名称。拼写错误的空间太大。)
EDIT: As pointed out in comments, you can still use Charset instances but have the ease of use of the String methods: new String(bytes, charset)and String.getBytes(charset).
编辑:正如评论中所指出的,您仍然可以使用 Charset 实例,但可以轻松使用 String 方法:new String(bytes, charset)和String.getBytes(charset)。
See "URL Encoding (or: 'What are those "%20
" codes in URLs?')".
回答by VonC
CharsetDecoder
should be what you are looking for, no ?
CharsetDecoder
应该是你要找的,不是吗?
Many network protocols and files store their characters with a byte-oriented character set such as ISO-8859-1
(ISO-Latin-1
).
However, Java's native character encoding is UnicodeUTF16BE (Sixteen-bit UCS Transformation Format, big-endian byte order).
许多网络协议和文件使用面向字节的字符集存储它们的字符,例如ISO-8859-1
( ISO-Latin-1
)。
但是,Java 的本机字符编码是统一码UTF16BE(十六位 UCS 转换格式,大端字节序)。
See Charset
. That doesn't mean UTF16
is the default charset (i.e.: the default "mapping between sequences of sixteen-bit Unicode code unitsand sequences of bytes"):
见Charset
。这并不意味着UTF16
是默认字符集(即:默认的“十六位Unicode 代码单元序列和字节序列之间的映射”):
Every instance of the Java virtual machine has a default charset, which may or may not be one of the standard charsets.
[US-ASCII
,ISO-8859-1
a.k.a.ISO-LATIN-1
,UTF-8
,UTF-16BE
,UTF-16LE
,UTF-16
]
The default charset is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system.
Java 虚拟机的每个实例都有一个默认字符集,它可能是也可能不是标准字符集之一。
[US-ASCII
,ISO-8859-1
akaISO-LATIN-1
,UTF-8
,UTF-16BE
,UTF-16LE
,UTF-16
]
默认字符集是在虚拟机启动期间确定的,通常取决于底层操作系统使用的区域设置和字符集。
This example demonstrates how to convert ISO-8859-1
encoded bytes in a ByteBuffer
to a string in a CharBuffer
and visa versa.
此示例演示如何将 a 中的ISO-8859-1
编码字节转换ByteBuffer
为 a 中的字符串,CharBuffer
反之亦然。
// Create the encoder and decoder for ISO-8859-1
Charset charset = Charset.forName("ISO-8859-1");
CharsetDecoder decoder = charset.newDecoder();
CharsetEncoder encoder = charset.newEncoder();
try {
// Convert a string to ISO-LATIN-1 bytes in a ByteBuffer
// The new ByteBuffer is ready to be read.
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap("a string"));
// Convert ISO-LATIN-1 bytes in a ByteBuffer to a character ByteBuffer and then to a string.
// The new ByteBuffer is ready to be read.
CharBuffer cbuf = decoder.decode(bbuf);
String s = cbuf.toString();
} catch (CharacterCodingException e) {
}
回答by Jon Skeet
It is a whole lot easier if you think of unicode as a character set (which it actually is - it is very basically the numbered set of all known characters). You can encode it as UTF-8 (1-3 bytes per character depending) or maybe UTF-16 (2 bytes per character or 4 bytes using surrogate pairs).
如果您将 unicode 视为一个字符集(它实际上是 - 它基本上是所有已知字符的编号集),那就容易多了。您可以将其编码为 UTF-8(每个字符 1-3 个字节,具体取决于)或 UTF-16(每个字符 2 个字节或使用代理对的 4 个字节)。
Back in the mist of time Java used to use UCS-2 to encode the unicode character set. This could only handle 2 bytes per character and is now obsolete. It was a fairly obvious hack to add surrogate pairs and move up to UTF-16.
回到过去,Java 曾经使用 UCS-2 来编码 unicode 字符集。这只能处理每个字符 2 个字节,现在已经过时了。添加代理对并向上移动到 UTF-16 是一个相当明显的技巧。
A lot of people think they should have used UTF-8 in the first place. When Java was originally written unicode had far more than 65535 characters anyway...
很多人认为他们应该首先使用 UTF-8。当 Java 最初被编写时,unicode 无论如何都远远超过 65535 个字符......
回答by brijesh k
UTF-8 and UCS-2/UTF-16 can be distinguished reasonably easily via a byte order mark at the start of the file. If this exists then it's a pretty good bet that the file is in that encoding - but it's not a dead certainty. You may well also find that the file is in one of those encodings, but doesn't have a byte order mark.
通过文件开头的字节顺序标记,可以很容易地区分 UTF-8 和 UCS-2/UTF-16。如果它存在,那么可以很好地打赌该文件采用该编码 - 但这并不是绝对的确定性。您可能还会发现该文件采用其中一种编码,但没有字节顺序标记。
I don't know much about ISO-8859-2, but I wouldn't be surprised if almost every file is a valid text file in that encoding. The best you'll be able to do is check it heuristically. Indeed, the Wikipedia page talking about it would suggest that only byte 0x7f is invalid.
我对 ISO-8859-2 了解不多,但如果几乎每个文件都是该编码的有效文本文件,我也不会感到惊讶。您能做的最好的事情就是启发式地检查它。事实上,谈论它的维基百科页面会表明只有字节 0x7f 是无效的。
There's no idea of reading a file "as it is" and yet getting text out - a file is a sequence of bytes, so you have to apply a character encoding in order to decode those bytes into characters.
不知道“按原样”读取文件并输出文本 - 文件是一个字节序列,因此您必须应用字符编码才能将这些字节解码为字符。
Source by stackoverflow
来自stackoverflow的来源
回答by wallabui
I would just like to add that if the String is originally encoded using the wrong encoding it might be impossible to change it to another encoding without errors. The question does not state that the conversion here is made from wrong encoding to correct encoding but I personally stumbled to this question just because of this situation so just a heads up for others as well.
我只想补充一点,如果字符串最初是使用错误的编码进行编码的,则可能无法将其更改为另一种编码而不会出错。这个问题并没有说明这里的转换是从错误的编码到正确的编码,但我个人偶然发现了这个问题,只是因为这种情况,所以也只是提醒其他人。
This answer in other question gives an explanation why the conversion does not always yield correct results https://stackoverflow.com/a/2623793/4702806
其他问题中的这个答案解释了为什么转换并不总是产生正确的结果 https://stackoverflow.com/a/2623793/4702806