Java：如何确定流的正确字符集编码

Question

提问by Joel

With reference to the following thread: Java App : Unable to read iso-8859-1 encoded file correctly

What is the best way to programatically determine the correct charset encoding of an inputstream/file ?

以编程方式确定输入流/文件的正确字符集编码的最佳方法是什么？

I have tried using the following:

我尝试使用以下方法：

File in =  new File(args[0]);
InputStreamReader r = new InputStreamReader(new FileInputStream(in));
System.out.println(r.getEncoding());

But on a file which I know to be encoded with ISO8859_1 the above code yields ASCII, which is not correct, and does not allow me to correctly render the content of the file back to the console.

但是在我知道用 ISO8859_1 编码的文件上，上面的代码产生 ASCII，这是不正确的，并且不允许我正确地将文件的内容呈现回控制台。

Answer 1

采纳答案by Luciano Fiandesio

I have used this library, similar to jchardet for detecting encoding in Java: http://code.google.com/p/juniversalchardet/

我使用了这个库，类似于 jchardet 来检测 Java 中的编码：http: //code.google.com/p/juniversalchardet/

Answer 2

回答by Kevin

Can you pick the appropriate char set in the Constructor:

你能在构造函数中选择合适的字符集吗：

new InputStreamReader(new FileInputStream(in), "ISO8859_1");

Answer 3

回答by Eduard Wirch

You cannot determine the encoding of a arbitrary byte stream. This is the nature of encodings. A encoding means a mapping between a byte value and its representation. So every encoding "could" be the right.

您无法确定任意字节流的编码。这是编码的本质。编码意味着字节值与其表示之间的映射。所以每个编码“都可能”是正确的。

The getEncoding()method will return the encoding which was set up (read the JavaDoc) for the stream. It will not guess the encoding for you.

的getEncoding（）方法将返回其设置（读取的编码的JavaDoc），用于该流。它不会为您猜测编码。

Some streams tell you which encoding was used to create them: XML, HTML. But not an arbitrary byte stream.

一些流会告诉您使用哪种编码来创建它们：XML、HTML。但不是任意的字节流。

Anyway, you could try to guess an encoding on your own if you have to. Every language has a common frequency for every char. In English the char e appears very often but ê will appear very very seldom. In a ISO-8859-1 stream there are usually no 0x00 chars. But a UTF-16 stream has a lot of them.

无论如何，如果需要，您可以尝试自己猜测编码。每种语言对每个字符都有一个共同的频率。在英语中，char e 经常出现，但 ê 很少出现。在 ISO-8859-1 流中，通常没有 0x00 字符。但是 UTF-16 流有很多。

Or: you could ask the user. I've already seen applications which present you a snippet of the file in different encodings and ask you to select the "correct" one.

或者：您可以询问用户。我已经看到应用程序会以不同的编码为您呈现文件片段，并要求您选择“正确”的编码。

Answer 4

回答by Fabian Steeg

If you don't know the encoding of your data, it is not so easy to determine, but you could try to use a library to guess it. Also, there is a similar question.

如果你不知道你的数据的编码，确定不是那么容易，但你可以尝试使用一个库来猜测它。另外，还有一个类似的问题。

Answer 5

回答by Zach Scrivena

You can certainly validatethe file for a particular charset by decodingit with a CharsetDecoderand watching out for "malformed-input" or "unmappable-character" errors. Of course, this only tells you if a charset is wrong; it doesn't tell you if it is correct. For that, you need a basis of comparison to evaluate the decoded results, e.g. do you know beforehand if the characters are restricted to some subset, or whether the text adheres to some strict format? The bottom line is that charset detection is guesswork without any guarantees.

您当然可以通过使用 a 对其进行解码并注意“格式错误的输入”或“不可映射的字符”错误来验证特定字符集的文件。当然，这只会告诉您字符集是否错误；它不会告诉你它是否正确。为此，您需要一个比较基础来评估解码结果，例如，您是否事先知道字符是否仅限于某个子集，或者文本是否遵循某种严格的格式？底线是字符集检测是猜测，没有任何保证。CharsetDecoder

Answer 6

回答by brianegge

For ISO8859_1 files, there is not an easy way to distinguish them from ASCII. For Unicode files however one can generally detect this based on the first few bytes of the file.

对于 ISO8859_1 文件，没有一种简单的方法可以将它们与 ASCII 区分开来。然而，对于 Unicode 文件，通常可以根据文件的前几个字节检测到这一点。

UTF-8 and UTF-16 files include a Byte Order Mark(BOM) at the very beginning of the file. The BOM is a zero-width non-breaking space.

UTF-8 和 UTF-16 文件在文件的最开头包含一个字节顺序标记(BOM)。BOM 是一个零宽度的不间断空间。

Unfortunately, for historical reasons, Java does not detect this automatically. Programs like Notepad will check the BOM and use the appropriate encoding. Using unix or Cygwin, you can check the BOM with the file command. For example:

不幸的是，由于历史原因，Java 不会自动检测到这一点。记事本等程序将检查 BOM 并使用适当的编码。使用 unix 或 Cygwin，您可以使用 file 命令检查 BOM。例如：

$ file sample2.sql 
sample2.sql: Unicode text, UTF-16, big-endian

For Java, I suggest you check out this code, which will detect the common file formats and select the correct encoding: How to read a file and automatically specify the correct encoding

对于Java，我建议您查看此代码，它将检测常见的文件格式并选择正确的编码：如何读取文件并自动指定正确的编码

Answer 7

回答by falcon

I found a nice third party library which can detect actual encoding: http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding

我找到了一个很好的第三方库，可以检测实际编码：http: //glaforge.free.fr/wiki/index.php?wiki=GuessEncoding

I didn't test it extensively but it seems to work.

我没有对其进行广泛的测试，但它似乎有效。

Answer 8

回答by Lorrat

The libs above are simple BOM detectors which of course only work if there is a BOM in the beginning of the file. Take a look at http://jchardet.sourceforge.net/which does scans the text

上面的库是简单的 BOM 检测器，当然只有在文件开头有 BOM 时才能工作。看一下http://jchardet.sourceforge.net/它会扫描文本

Answer 9

回答by user345883

check this out: http://site.icu-project.org/(icu4j) they have libraries for detecting charset from IOStream could be simple like this:

看看这个： http://site.icu-project.org/ (icu4j) 他们有用于从 IOStream 检测字符集的库，可能很简单：

BufferedInputStream bis = new BufferedInputStream(input);
CharsetDetector cd = new CharsetDetector();
cd.setText(bis);
CharsetMatch cm = cd.detect();

if (cm != null) {
   reader = cm.getReader();
   charset = cm.getName();
}else {
   throw new UnsupportedCharsetException()
}

Answer 10

回答by ssamuel68

If you use ICU4J (http://icu-project.org/apiref/icu4j/)

如果您使用 ICU4J ( http://icu-project.org/apiref/icu4j/)

Here is my code:

这是我的代码：

String charset = "ISO-8859-1"; //Default chartset, put whatever you want

byte[] fileContent = null;
FileInputStream fin = null;

//create FileInputStream object
fin = new FileInputStream(file.getPath());

/*
 * Create byte array large enough to hold the content of the file.
 * Use File.length to determine size of the file in bytes.
 */
fileContent = new byte[(int) file.length()];

/*
 * To read content of the file in byte array, use
 * int read(byte[] byteArray) method of java FileInputStream class.
 *
 */
fin.read(fileContent);

byte[] data =  fileContent;

CharsetDetector detector = new CharsetDetector();
detector.setText(data);

CharsetMatch cm = detector.detect();

if (cm != null) {
    int confidence = cm.getConfidence();
    System.out.println("Encoding: " + cm.getName() + " - Confidence: " + confidence + "%");
    //Here you have the encode name and the confidence
    //In my case if the confidence is > 50 I return the encode, else I return the default value
    if (confidence > 50) {
        charset = cm.getName();
    }
}

Remember to put all the try-catch need it.

记得把所有的 try-catch 都需要它。

I hope this works for you.

我希望这对你有用。

Java：如何确定流的正确字符集编码

提问by Joel

采纳答案by Luciano Fiandesio

回答by Kevin

回答by Eduard Wirch

回答by Fabian Steeg

回答by Zach Scrivena

回答by brianegge

回答by falcon

回答by Lorrat

回答by user345883

回答by ssamuel68

相关推荐

最近更新

标签

Java：如何确定流的正确字符集编码

提问by Joel

采纳答案by Luciano Fiandesio

回答by Kevin

回答by Eduard Wirch

回答by Fabian Steeg

回答by Zach Scrivena

回答by brianegge

回答by falcon

回答by Lorrat

回答by user345883

回答by ssamuel68

相关推荐

如何使用 ZonedDateTime 或 Java 8 将任何日期时间转换为 UTC

Java 无法以管理员身份更改 Windows 10 中的系统变量路径

Java：如何检查日期是否在一定范围内？

Java com.datastax.driver.core.exceptions.InvalidQueryException：未配置的表 schema_keyspaces

相关推荐

最近更新

标签