Java 字符编码检测算法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/774075/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Character Encoding Detection Algorithm
提问by Jon
I'm looking for a way to detect character sets within documents. I've been reading the Mozilla character set detection implementation here:
我正在寻找一种方法来检测文档中的字符集。我一直在这里阅读 Mozilla 字符集检测实现:
I've also found a Java implementation of this called jCharDet:
我还发现了一个名为 jCharDet 的 Java 实现:
Both of these are based on research carried out using a set of static data. What I'm wondering is whether anybody has used any other implementation successfully and if so what? Did you roll your own approach and if so what was the algorithm you used to detect the character set?
这两者都基于使用一组静态数据进行的研究。我想知道是否有人成功地使用了任何其他实现,如果是,那又如何?您是否采用了自己的方法,如果是,您用于检测字符集的算法是什么?
Any help would be appreciated. I'm not looking for a list of existing approaches via Google, nor am I looking for a link to the Joel Spolsky article - just to clarify : )
任何帮助,将不胜感激。我不是在通过 Google 寻找现有方法的列表,也不是在寻找 Joel Spolsky 文章的链接 - 只是为了澄清:)
UPDATE:I did a bunch of research into this and ended up finding a framework called cpdetector that uses a pluggable approach to character detection, see:
更新:我对此进行了大量研究,最终找到了一个名为 cpdetector 的框架,该框架使用可插入的字符检测方法,请参阅:
This provides BOM, chardet (Mozilla approach) and ASCII detection plugins. It's also very easy to write your own. There's also another framework, which provides much better character detection that the Mozilla approach/jchardet etc...
这提供了 BOM、chardet(Mozilla 方法)和 ASCII 检测插件。自己编写也很容易。还有另一个框架,它提供了比 Mozilla 方法/jchardet 等更好的字符检测......
It's quite easy to write your own plugin for cpdetector that uses this framework to provide a more accurate character encoding detection algorithm. It works better than the Mozilla approach.
为 cpdetector 编写自己的插件非常容易,它使用这个框架来提供更准确的字符编码检测算法。它比 Mozilla 方法更有效。
采纳答案by Jared Oberhaus
Years ago we had character set detection for a mail application, and we rolled our own. The mail app was actually a WAP application, and the phone expected UTF-8. There were several steps:
多年前,我们对邮件应用程序进行了字符集检测,然后我们推出了自己的。邮件应用程序实际上是一个 WAP 应用程序,而手机需要 UTF-8。有几个步骤:
Universal
普遍的
We could easily detect if text was UTF-8, as there is a specific bit pattern in the top bits of bytes 2/3/etc. Once you found that pattern repeated a certain number of times you could be certain it was UTF-8.
我们可以轻松检测文本是否为 UTF-8,因为在字节 2/3/等的最高位中有特定的位模式。一旦您发现该模式重复了一定次数,您就可以确定它是 UTF-8。
If the file begins with a UTF-16 byte order mark, you can probably assume the rest of the text is that encoding. Otherwise, detecting UTF-16 isn't nearly as easy as UTF-8, unless you can detect the surrogate pairs pattern: but the use of surrogate pairs is rare, so that doesn't usually work. UTF-32 is similar, except there are no surrogate pairs to detect.
如果文件以 UTF-16 字节顺序标记开始,您可能会假设文本的其余部分是该编码。否则,检测 UTF-16 几乎不像 UTF-8 那样容易,除非您可以检测代理对模式:但是代理对的使用很少,因此通常不起作用。UTF-32 类似,只是没有要检测的代理对。
Regional detection
区域检测
Next we would assume the reader was in a certain region. For instance, if the user was seeing the UI localized in Japanese, we could then attempt detection of the three main Japanese encodings. ISO-2022-JP is again east to detect with the escape sequences. If that fails, determining the difference between EUC-JP and Shift-JIS is not as straightforward. It's more likely that a user would receive Shift-JIS text, but there were characters in EUC-JP that didn't exist in Shift-JIS, and vice-versa, so sometimes you could get a good match.
接下来我们假设读者在某个区域。例如,如果用户看到的是日语本地化的 UI,我们可以尝试检测三种主要的日语编码。ISO-2022-JP 再次使用转义序列进行检测。如果失败,确定 EUC-JP 和 Shift-JIS 之间的区别就不是那么简单了。用户更有可能收到 Shift-JIS 文本,但 EUC-JP 中存在 Shift-JIS 中不存在的字符,反之亦然,因此有时您可以获得很好的匹配。
The same procedure was used for Chinese encodings and other regions.
相同的程序用于中文编码和其他区域。
User's choice
用户的选择
If these didn't provide satisfactory results, the user must manually choose an encoding.
如果这些没有提供令人满意的结果,用户必须手动选择编码。
回答by McDowell
Not exactly what you asked for, but I noticed that the ICU projectincludes a CharsetDetectorclass.
不完全是你所要求的,但我注意到ICU 项目包括一个CharsetDetector类。