猜测在 Java 中表示为 byte[] 的文本的编码

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1677497/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-29 17:34:13  来源:igfitidea点击:

Guessing the encoding of text represented as byte[] in Java

javaencodingutf-8character-encoding

提问by knorv

Given an array of bytes representing text in some unknown encoding (usually UTF-8 or ISO-8859-1, but not necessarily so), what is the best way to obtain a guess for the most likely encoding used (in Java)?

给定以某种未知编码(通常是 UTF-8 或 ISO-8859-1,但不一定如此)表示文本的字节数组,获得最可能使用的编码(在 Java 中)的猜测的最佳方法是什么?

Worth noting:

值得注意:

  • No additional meta-data is available. The byte array is literally the only available input.
  • The detection algorithm will obviously not be 100 % correct. If the algorithm is correct in more than say 80 % of the cases that is good enough.
  • 没有额外的元数据可用。字节数组实际上是唯一可用的输入。
  • 检测算法显然不是 100% 正确的。如果算法在超过 80% 的情况下是正确的,那就足够了。

采纳答案by knorv

The following method solves the problem using juniversalchardet, which is a Java port of Mozilla's encoding detection library.

下面的方法使用juniversalchardet来解决这个问题,它是 Mozilla 的编码检测库的 Java 端口。

public static String guessEncoding(byte[] bytes) {
    String DEFAULT_ENCODING = "UTF-8";
    org.mozilla.universalchardet.UniversalDetector detector =
        new org.mozilla.universalchardet.UniversalDetector(null);
    detector.handleData(bytes, 0, bytes.length);
    detector.dataEnd();
    String encoding = detector.getDetectedCharset();
    detector.reset();
    if (encoding == null) {
        encoding = DEFAULT_ENCODING;
    }
    return encoding;
}

The code above has been tested and works as intented. Simply add juniversalchardet-1.0.3.jarto the classpath.

上面的代码已经过测试并按预期工作。只需将juniversalchardet-1.0.3.jar添加到类路径。

I've tested both juniversalchardetand jchardet. My general impression is that juniversalchardet provides the better detection accuracy and the nicer API of the two libraries.

我已经测试了juniversalchardetjchardet。我的总体印象是 juniversalchardet 提供了两个库中更好的检测精度和更好的 API。

回答by Alan Moore

Here's my favorite: https://github.com/codehaus/guessencoding

这是我最喜欢的:https: //github.com/codehaus/guessencoding

It works like this:

它是这样工作的:

  • If there's a UTF-8 or UTF-16 BOM, return that encoding.
  • If none of the bytes have the high-order bit set, return ASCII (or you can force it to return a default 8-bit encoding instead).
  • If there are bytes with the high bit set but they're arranged in the correct patterns for UTF-8, return UTF-8.
  • Otherwise, return the platform default encoding (e.g., windows-1252 on an English-locale Windows system).
  • 如果有 UTF-8 或 UTF-16 BOM,则返回该编码。
  • 如果所有字节都没有设置高位,则返回 ASCII(或者您可以强制它返回默认的 8 位编码)。
  • 如果有设置了高位的字节,但它们以正确的 UTF-8 模式排列,则返回 UTF-8。
  • 否则,返回平台默认编码(例如,英语语言环境 Windows 系统上的 windows-1252)。

It may sound overly simplistic, but in my day-to-day work it's well over 90% accurate.

这听起来可能过于简单,但在我的日常工作中,它的准确率远远超过 90%。

回答by Thomas Mueller

There is also Apache Tika - a content analysis toolkit. It can guess the mime type, and it can guess the encoding. Usually the guess is correct with a very high probability.

还有Apache Tika - 一个内容分析工具包。它可以猜测 mime 类型,也可以猜测编码。通常猜测是正确的,概率非常高。

回答by Rooke

Chi's answer seems most promising for real use. I just want to add that, according to Joel Spolsky, Internet Explorer used a frequency-based guessing algorithm in its day:

Chi 的答案似乎最有希望用于实际用途。我只想补充一点,根据 Joel Spolsky 的说法,Internet Explorer 在当时使用了一种基于频率的猜测算法:

http://www.joelonsoftware.com/articles/Unicode.html

http://www.joelonsoftware.com/articles/Unicode.html

Roughly speaking, all the assumed-to-be-text is copied, and parsed in every encoding imaginable. Whichever parse fits a language's average word (and letter?) frequency profile best, wins. I can not quickly see if jchardet uses the same kind of approach, so I thought I'd mention this just in case.

粗略地说,所有假定的文本都被复制,并以可以想象的每种编码进行解析。无论哪种解析最适合语言的平均单词(和字母?)频率分布,都将获胜。我不能很快看到 jchardet 是否使用相同的方法,所以我想我会提到这个以防万一。

回答by Chi

Check out jchardet

查看jchardet

回答by gomesla

Should be stuff already available

应该已经有货了

google search turned up icu4j

谷歌搜索出现icu4j

or

或者

http://jchardet.sourceforge.net/

http://jchardet.sourceforge.net/

回答by ZZ Coder

Without encoding indicator, you will never know for sure. However, you can make some intelligent guesses. See my answer to this question,

如果没有编码指示符,您将永远无法确定。但是,您可以做出一些明智的猜测。看我对这个问题的回答,

How to determine if a String contains invalid encoded characters

如何确定字符串是否包含无效的编码字符

Use the validUTF8() methods. If it returns true, treat it as UTF8, otherwise as Latin-1.

使用 validUTF8() 方法。如果返回 true,则将其视为 UTF8,否则视为 Latin-1。