猜测在 Java 中表示为 byte[] 的文本的编码

Question

提问by knorv

Given an array of bytes representing text in some unknown encoding (usually UTF-8 or ISO-8859-1, but not necessarily so), what is the best way to obtain a guess for the most likely encoding used (in Java)?

给定以某种未知编码（通常是 UTF-8 或 ISO-8859-1，但不一定如此）表示文本的字节数组，获得最可能使用的编码（在 Java 中）的猜测的最佳方法是什么？

Worth noting:

值得注意：

No additional meta-data is available. The byte array is literally the only available input.
The detection algorithm will obviously not be 100 % correct. If the algorithm is correct in more than say 80 % of the cases that is good enough.

没有额外的元数据可用。字节数组实际上是唯一可用的输入。
检测算法显然不是 100% 正确的。如果算法在超过 80% 的情况下是正确的，那就足够了。

Answer 1

采纳答案by knorv

The following method solves the problem using juniversalchardet, which is a Java port of Mozilla's encoding detection library.

下面的方法使用juniversalchardet来解决这个问题，它是 Mozilla 的编码检测库的 Java 端口。

public static String guessEncoding(byte[] bytes) {
    String DEFAULT_ENCODING = "UTF-8";
    org.mozilla.universalchardet.UniversalDetector detector =
        new org.mozilla.universalchardet.UniversalDetector(null);
    detector.handleData(bytes, 0, bytes.length);
    detector.dataEnd();
    String encoding = detector.getDetectedCharset();
    detector.reset();
    if (encoding == null) {
        encoding = DEFAULT_ENCODING;
    }
    return encoding;
}

The code above has been tested and works as intented. Simply add juniversalchardet-1.0.3.jarto the classpath.

上面的代码已经过测试并按预期工作。只需将juniversalchardet-1.0.3.jar添加到类路径。

I've tested both juniversalchardetand jchardet. My general impression is that juniversalchardet provides the better detection accuracy and the nicer API of the two libraries.

我已经测试了juniversalchardet和jchardet。我的总体印象是 juniversalchardet 提供了两个库中更好的检测精度和更好的 API。

Answer 2

回答by Alan Moore

Here's my favorite: https://github.com/codehaus/guessencoding

这是我最喜欢的：https: //github.com/codehaus/guessencoding

It works like this:

它是这样工作的：

If there's a UTF-8 or UTF-16 BOM, return that encoding.
If none of the bytes have the high-order bit set, return ASCII (or you can force it to return a default 8-bit encoding instead).
If there are bytes with the high bit set but they're arranged in the correct patterns for UTF-8, return UTF-8.
Otherwise, return the platform default encoding (e.g., windows-1252 on an English-locale Windows system).

如果有 UTF-8 或 UTF-16 BOM，则返回该编码。
如果所有字节都没有设置高位，则返回 ASCII（或者您可以强制它返回默认的 8 位编码）。
如果有设置了高位的字节，但它们以正确的 UTF-8 模式排列，则返回 UTF-8。
否则，返回平台默认编码（例如，英语语言环境 Windows 系统上的 windows-1252）。

It may sound overly simplistic, but in my day-to-day work it's well over 90% accurate.

这听起来可能过于简单，但在我的日常工作中，它的准确率远远超过 90%。

Answer 3

回答by Thomas Mueller

There is also Apache Tika - a content analysis toolkit. It can guess the mime type, and it can guess the encoding. Usually the guess is correct with a very high probability.

还有Apache Tika - 一个内容分析工具包。它可以猜测 mime 类型，也可以猜测编码。通常猜测是正确的，概率非常高。

Answer 4

回答by Rooke

Chi's answer seems most promising for real use. I just want to add that, according to Joel Spolsky, Internet Explorer used a frequency-based guessing algorithm in its day:

Chi 的答案似乎最有希望用于实际用途。我只想补充一点，根据 Joel Spolsky 的说法，Internet Explorer 在当时使用了一种基于频率的猜测算法：

http://www.joelonsoftware.com/articles/Unicode.html

Roughly speaking, all the assumed-to-be-text is copied, and parsed in every encoding imaginable. Whichever parse fits a language's average word (and letter?) frequency profile best, wins. I can not quickly see if jchardet uses the same kind of approach, so I thought I'd mention this just in case.

粗略地说，所有假定的文本都被复制，并以可以想象的每种编码进行解析。无论哪种解析最适合语言的平均单词（和字母？）频率分布，都将获胜。我不能很快看到 jchardet 是否使用相同的方法，所以我想我会提到这个以防万一。

Answer 5

回答by Chi

Check out jchardet

查看jchardet

Answer 6

回答by gomesla

Should be stuff already available

应该已经有货了

google search turned up icu4j

谷歌搜索出现icu4j

or

或者

http://jchardet.sourceforge.net/

Answer 7

回答by ZZ Coder

Without encoding indicator, you will never know for sure. However, you can make some intelligent guesses. See my answer to this question,

如果没有编码指示符，您将永远无法确定。但是，您可以做出一些明智的猜测。看我对这个问题的回答，

How to determine if a String contains invalid encoded characters

如何确定字符串是否包含无效的编码字符

Use the validUTF8() methods. If it returns true, treat it as UTF8, otherwise as Latin-1.

使用 validUTF8() 方法。如果返回 true，则将其视为 UTF8，否则视为 Latin-1。

猜测在 Java 中表示为 byte[] 的文本的编码

提问by knorv

采纳答案by knorv

回答by Alan Moore

回答by Thomas Mueller

回答by Rooke

回答by Chi

回答by gomesla

回答by ZZ Coder

相关推荐

最近更新

标签

猜测在 Java 中表示为 byte[] 的文本的编码

提问by knorv

采纳答案by knorv

回答by Alan Moore

回答by Thomas Mueller

回答by Rooke

回答by Chi

回答by gomesla

回答by ZZ Coder

相关推荐

java Android：具有两种不同视图的 EfficientAdapter

用于启用扩展序列化调试信息的 Java 标志

Java 调度执行程序未处理的异常

在 Java 中推广？

相关推荐

最近更新

标签