Java 文本文件编码
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1288899/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Java Text File Encoding
提问by user
I have a text file and it can be ANSI (with ISO-8859-2 charset), UTF-8, UCS-2 Big or Little Endian.
我有一个文本文件,它可以是 ANSI(带有 ISO-8859-2 字符集)、UTF-8、UCS-2 大端或小端。
Is there any way to detect the encoding of the file to read it properly?
有没有办法检测文件的编码以正确读取它?
Or is it possible to read a file without giving the encoding? (and it reads the file as it is)
或者是否可以在不提供编码的情况下读取文件?(并按原样读取文件)
(There are several program that can detect and convert encoding/format of text files.)
(有几个程序可以检测和转换文本文件的编码/格式。)
采纳答案by Jon Skeet
UTF-8 and UCS-2/UTF-16 can be distinguished reasonably easily via a byte order markat the start of the file. If this exists then it's a pretty good betthat the file is in that encoding - but it's not a dead certainty. You may well also find that the file isin one of those encodings, but doesn't have a byte order mark.
UTF-8 和 UCS-2/UTF-16 可以通过文件开头的字节顺序标记轻松区分。如果它存在,那么可以很好地打赌该文件采用该编码 - 但这并不是绝对的确定性。您可能还会发现该文件采用其中一种编码,但没有字节顺序标记。
I don't know much about ISO-8859-2, but I wouldn't be surprised if almost everyfile is a valid text file in that encoding. The best you'll be able to do is check it heuristically. Indeed, the Wikipedia pagetalking about it would suggest that only byte 0x7f is invalid.
我对 ISO-8859-2 了解不多,但如果几乎每个文件都是该编码的有效文本文件,我也不会感到惊讶。您能做的最好的事情就是启发式地检查它。事实上,谈论它的维基百科页面会表明只有字节 0x7f 是无效的。
There's no idea of reading a file "as it is" and yet getting text out - a file is a sequence of bytes, so you have to apply a character encoding in order to decode those bytes into characters.
不知道“按原样”读取文件并输出文本 - 文件是一个字节序列,因此您必须应用字符编码才能将这些字节解码为字符。
回答by Jon
Yes, there's a number of methods to do character encoding detection, specifically in Java. Take a look at jchardetwhich is based on the Mozilla algorithm. There's also cpdetectorand a project by IBM called ICU4j. I'd take a look at the latter, as it seems to be more reliable than the other two. They work based on statistical analysis of the binary file, ICU4j will also provide a confidence level of the character encoding it detects so you can use this in the case above. It works pretty well.
是的,有很多方法可以进行字符编码检测,特别是在 Java 中。看看基于 Mozilla 算法的jchardet。还有cpdetector和 IBM 的一个名为ICU4j的项目。我会看看后者,因为它似乎比其他两个更可靠。它们基于二进制文件的统计分析工作,ICU4j 还将提供它检测到的字符编码的置信度,因此您可以在上述情况下使用它。它工作得很好。
回答by ssamuel68
You can use ICU4J (http://icu-project.org/apiref/icu4j/)
您可以使用 ICU4J ( http://icu-project.org/apiref/icu4j/)
Here is my code:
这是我的代码:
String charset = "ISO-8859-1"; //Default chartset, put whatever you want
byte[] fileContent = null;
FileInputStream fin = null;
//create FileInputStream object
fin = new FileInputStream(file.getPath());
/*
* Create byte array large enough to hold the content of the file.
* Use File.length to determine size of the file in bytes.
*/
fileContent = new byte[(int) file.length()];
/*
* To read content of the file in byte array, use
* int read(byte[] byteArray) method of java FileInputStream class.
*
*/
fin.read(fileContent);
byte[] data = fileContent;
CharsetDetector detector = new CharsetDetector();
detector.setText(data);
CharsetMatch cm = detector.detect();
if (cm != null) {
int confidence = cm.getConfidence();
System.out.println("Encoding: " + cm.getName() + " - Confidence: " + confidence + "%");
//Here you have the encode name and the confidence
//In my case if the confidence is > 50 I return the encode, else I return the default value
if (confidence > 50) {
charset = cm.getName();
}
}
Remember to put all the try catch need it.
记住把所有的try catch都需要它。
I hope this works for you.
我希望这对你有用。
回答by Glen
If your text file is a properly created Unicode text file then the Byte Order Mark (BOM) should tell you all the information you need. See herefor more details about BOM
如果您的文本文件是正确创建的 Unicode 文本文件,那么字节顺序标记 (BOM) 应该会告诉您所需的所有信息。有关 BOM 的更多详细信息,请参见此处
If it's not then you'll have to use some encoding detection library.
如果不是,那么您将不得不使用一些编码检测库。

