Java:读取器和编码
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1888189/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Java: Readers and Encodings
提问by Martijn Courteaux
Java's default encoding is ASCII. Yes? (See my edit below)
Java 的默认编码是ASCII. 是的?(见下面我的编辑)
When a textfile is encoded in UTF-8? How does a Reader know that he has to use UTF-8?
当文本文件被编码为UTF-8? Reader 如何知道他必须使用UTF-8?
The Readers I talk about are:
我谈论的读者是:
FileReadersBufferedReaders fromSockets- A
ScannerfromSystem.in - ...
FileReader秒BufferedReader来自Sockets- 一个
Scanner从System.in - ...
EDIT
编辑
It turns our the encoding is depends on the OS, which means that the following is not true on every OS:
原来我们的编码取决于操作系统,这意味着以下内容并非在每个操作系统上都是正确的:
'a'== 97
回答by BalusC
How does a Reader know that he have to use UTF-8?
读者如何知道他必须使用 UTF-8?
You normally specify that yourselfin an InputStreamReader. It has a constructor taking the character encoding. E.g.
通常你指定你自己的一个InputStreamReader。它有一个采用字符编码的构造函数。例如
Reader reader = new InputStreamReader(new FileInputStream("c:/foo.txt"), "UTF-8");
All other readers (as far as I know) uses the platform default character encoding, which may indeed not per-se be the correct encoding (such as -cough-CP-1252).
所有其他读者(据我所知)使用平台默认字符编码,这本身可能确实不是正确的编码(例如-cough-CP-1252)。
You can in theory also detect the character encoding automatically based on the byte order mark. This distinguishes the several unicode encodings from other encodings. Java SE unfortunately doesn't have any API for this, but you can homebrew one which can be used to replace InputStreamReaderas in the example here above:
理论上您也可以根据字节顺序标记自动检测字符编码。这将几种 unicode 编码与其他编码区分开来。不幸的是,Java SE 没有任何 API,但您可以自制一个可用于替换的 API,InputStreamReader如上面的示例所示:
public class UnicodeReader extends Reader {
private static final int BOM_SIZE = 4;
private final InputStreamReader reader;
/**
* Construct UnicodeReader
* @param in Input stream.
* @param defaultEncoding Default encoding to be used if BOM is not found,
* or <code>null</code> to use system default encoding.
* @throws IOException If an I/O error occurs.
*/
public UnicodeReader(InputStream in, String defaultEncoding) throws IOException {
byte bom[] = new byte[BOM_SIZE];
String encoding;
int unread;
PushbackInputStream pushbackStream = new PushbackInputStream(in, BOM_SIZE);
int n = pushbackStream.read(bom, 0, bom.length);
// Read ahead four bytes and check for BOM marks.
if ((bom[0] == (byte) 0xEF) && (bom[1] == (byte) 0xBB) && (bom[2] == (byte) 0xBF)) {
encoding = "UTF-8";
unread = n - 3;
} else if ((bom[0] == (byte) 0xFE) && (bom[1] == (byte) 0xFF)) {
encoding = "UTF-16BE";
unread = n - 2;
} else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {
encoding = "UTF-16LE";
unread = n - 2;
} else if ((bom[0] == (byte) 0x00) && (bom[1] == (byte) 0x00) && (bom[2] == (byte) 0xFE) && (bom[3] == (byte) 0xFF)) {
encoding = "UTF-32BE";
unread = n - 4;
} else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {
encoding = "UTF-32LE";
unread = n - 4;
} else {
encoding = defaultEncoding;
unread = n;
}
// Unread bytes if necessary and skip BOM marks.
if (unread > 0) {
pushbackStream.unread(bom, (n - unread), unread);
} else if (unread < -1) {
pushbackStream.unread(bom, 0, 0);
}
// Use given encoding.
if (encoding == null) {
reader = new InputStreamReader(pushbackStream);
} else {
reader = new InputStreamReader(pushbackStream, encoding);
}
}
public String getEncoding() {
return reader.getEncoding();
}
public int read(char[] cbuf, int off, int len) throws IOException {
return reader.read(cbuf, off, len);
}
public void close() throws IOException {
reader.close();
}
}
Editas a reply on your edit:
编辑作为对您编辑的回复:
So the encoding is depends on the OS. So that means that not on every OS this is true:
'a'== 97
所以编码取决于操作系统。所以这意味着并非在每个操作系统上都是如此:
'a'== 97
No, this is not true. The ASCIIencoding (which contains 128 characters, 0x00until with 0x7F) is the basisof all other character encodings. Only the characters outside the ASCIIcharset may risk to be displayed differently in another encoding. The ISO-8859encodings covers the characters in the ASCIIrange with the same codepoints. The Unicodeencodings covers the characters in the ISO-8859-1range with the same codepoints.
不,这不是真的。的ASCII(其含有128个字符,编码0x00,直到与0x7F)为基础的所有其它的字符编码。只有字符ASCII集之外的字符可能会以不同的方式显示在另一种编码中。该ISO-8859编码涵盖了人物ASCII以相同的代码点范围。该Unicode编码涵盖了人物ISO-8859-1以相同的代码点范围。
You may find each of those blogs an interesting read:
您可能会发现这些博客中的每一个都很有趣:
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)(more theoretical of the two)
- Unicode - How to get the characters right?(more practical of the two)
- 每个软件开发人员绝对必须了解的绝对最小值(没有任何借口!)(两者中的更多理论)
- Unicode - 如何获得正确的字符?(两者比较实用)
回答by kdgregory
Java's default encoding depends on your OS. For Windows, it's normally "windows-1252", for Unix it's typically "ISO-8859-1" or "UTF-8".
Java 的默认编码取决于您的操作系统。对于 Windows,它通常是“windows-1252”,对于 Unix,它通常是“ISO-8859-1”或“UTF-8”。
A reader knows the correct encoding because you tell it the correct encoding. Unfortunately, not all readers let you do this (for example, FileReaderdoesn't), so often you have to use an InputStreamReader.
读者知道正确的编码,因为你告诉它正确的编码。不幸的是,并非所有读者都允许您这样做(例如,FileReader不允许),因此您经常必须使用InputStreamReader.
回答by Joachim Sauer
I'd like to approach this part first:
我想先处理这部分:
Java's default encoding is ASCII. Yes?
Java 的默认编码是 ASCII。是的?
There are at least 4 different things in the Java environment that can arguably be called "default encoding":
Java 环境中至少有 4 种不同的东西可以称为“默认编码”:
- the "default charset" is what Java uses to convert bytes to characters (and
byte[]toString) at Runtime, when nothing else is specified. This one depends on the platform, settings, command line arguments, ... and is usually just the platform default encoding. - the internal character encoding that Java uses in
charvalues andStringobjects. This one is alwaysUTF-16! There is no way to change it, it just is UTF-16! This means that acharrepresentingaalwayshas the numeric value 97 and a char representingπalways has the numeric value 960. - the character encoding that Java uses to store String constants in
.classfiles. This one is alwaysUTF-8. There is no way to change it. - the charset that the Java compiler uses to interpret Java source code in
.javafiles. This one defaults to the default charset, but can be configured at compile time.
- 在“默认字符集”,也就是Java使用到字节转换为字符(并
byte[]以String在运行时,没有指定时的时候)。这取决于平台、设置、命令行参数等,通常只是平台默认编码。 - Java 在
char值和String对象中使用的内部字符编码。这始终是UTF-16!没有办法改变它,它只是UTF-16!这意味着 achar表示a始终具有数值 97,而 char 表示π始终具有数值 960。 - Java 用于在
.class文件中存储字符串常量的字符编码。这始终是UTF-8。没有办法改变它。 - Java 编译器用来解释
.java文件中的Java 源代码的字符集。这个默认为默认字符集,但可以在编译时配置。
How does a Reader know that he has to use UTF-8?
读者如何知道他必须使用 UTF-8?
It doesn't. If you have some plain text file, then you mustknow the encoding to read it correctly. If you're lucky you can guess (for example, you can try the platform default encoding), but that's an error-prone process and in many cases you wouldn't even have a way to realize that you guessed wrong. This is notspecific to Java. It's true for all systems.
它没有。如果您有一些纯文本文件,那么您必须知道编码才能正确读取它。如果幸运的话,您可以猜到(例如,您可以尝试使用平台默认编码),但这是一个容易出错的过程,在许多情况下,您甚至无法意识到自己猜错了。这不是Java 特有的。对于所有系统都是如此。
Some formats such as XML and all XML-based formats were designed with this restriction in mind and include a way to specify the encoding in the data, so that guessing is no longer necessary.
某些格式(例如 XML 和所有基于 XML 的格式)在设计时就考虑到了这一限制,并包含一种指定数据编码的方法,因此不再需要猜测。
Read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)for the details.
回答by BobMcGee
For most reader, Java uses whatever encoding & character set your platform does -- this may be some flavor of ASCII or UTF-8, or something more exotic like JIS (in Japan). Characters in this set are then converted to the UTF-16 which Java uses internally.
对于大多数读者来说,Java 使用您的平台所做的任何编码和字符集——这可能是某种 ASCII 或 UTF-8,或者更奇特的东西,比如 JIS(在日本)。然后,该集合中的字符将转换为 Java 内部使用的 UTF-16。
There's a work-around if the platform encoding is different than a file encoding (my problem -- UTF-8 files are standard, but my platform uses Windows-1252 encoding). Create an InputStreamReader instance that uses the constructor specifying encoding.
如果平台编码与文件编码不同(我的问题——UTF-8 文件是标准的,但我的平台使用 Windows-1252 编码),则有一个变通方法。创建一个 InputStreamReader 实例,该实例使用指定编码的构造函数。
Edit: do this like so:
编辑:这样做:
InputStreamReader myReader = new InputStreamReader(new FileInputStream(myFile),"UTF-8");
//read data
myReader.close();
However, IIRC there are some provisions to autodetect common encodings (such as UTF-8 and UTF-16). UTF-16 can be detected by the Byte Order Mark at the beginning. UTF-8 follows certain rules too, but generally the difference b/w your platform encoding and UTF-8 isn't going to matter unless you're using international characters in place of Latin ones.
但是,IIRC 有一些规定可以自动检测常见的编码(例如 UTF-8 和 UTF-16)。UTF-16 可以通过开头的字节顺序标记来检测。UTF-8 也遵循某些规则,但通常您的平台编码和 UTF-8 的黑白差异并不重要,除非您使用国际字符代替拉丁字符。
回答by Steve De Caux
You can start getting the idea here java Charset API
你可以在这里开始得到这个想法java Charset API
Note that according to the doc,
请注意,根据文档,
The native character encoding of the Java programming language is UTF-16
Java 编程语言的原生字符编码是 UTF-16
EDIT :
编辑 :
sorry I got called away before I could finish this, maybe I shouldn't have posted the partial answer as it was. Anyway, the other answers explain the details, the point being that the native file charset for each platform together with common alternate charsets will be read correctly by java.
抱歉,我还没完成就被叫走了,也许我不应该像现在这样发布部分答案。无论如何,其他答案解释了细节,重点是每个平台的本机文件字符集以及通用备用字符集将被 java 正确读取。

