Java 什么是最准确的编码检测器?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3759356/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What is the most accurate encoding detector?
提问by Winston Chen
After certain survey, I come to discover that there are a few encoding detection project in java world, if the getEncoding
in InputStreamReader
does not work:
经过一定的调查,我发现java世界有几个编码检测项目,如果getEncoding
inInputStreamReader
不起作用:
However, I really do not know which is the best among the all. Can anyone with hand-on experience tell me which one is the best in Java?
但是,我真的不知道哪个是最好的。任何有实践经验的人都可以告诉我哪一个是 Java 中最好的吗?
回答by fglez
I've personally used jchardet in our project (juniversalchardet wasn't available back then) just to check if a stream was UTF-8 or not.
我个人在我们的项目中使用了 jchardet(当时还没有 jchardet)只是为了检查流是否为 UTF-8。
It was easier to integrate with our application than the other and yielded great results.
与我们的应用程序相比,它更容易与我们的应用程序集成,并产生了很好的结果。
回答by Winston Chen
I found an answer online:
我在网上找到了一个答案:
http://fredeaker.blogspot.com/2007/01/character-encoding-detection.html
http://fredeaker.blogspot.com/2007/01/character-encoding-detection.html
It says something vealuable here:
它在这里说了一些有价值的事情:
The strength of a character encoding detector lies in whether or not its focus is on statistical analysis or HTML META and XML prolog discovery. If you are processing HTML files that have META, use cpdetector. Otherwise, your best bet is either monq.stuff.EncodingDetector or com.sun.syndication.io.XmlReader.
字符编码检测器的优势在于它的重点是统计分析还是 HTML META 和 XML prolog 发现。如果您正在处理具有 META 的 HTML 文件,请使用 cpdetector。否则,最好的选择是 monq.stuff.EncodingDetector 或 com.sun.syndication.io.XmlReader。
So that's why I am using cpdetectornow. I will update the post with the result of it.
所以这就是我现在使用cpdetector的原因。我会用它的结果更新帖子。
回答by yishaiz
I've checked juniversalchardet and ICU4J on some CSV files, and the results are inconsistent: juniversalchardet had better results:
我在一些CSV 文件上检查过 juniversalchardet 和 ICU4J ,结果不一致: juniversalchardet 有更好的结果:
- UTF-8: Both detected.
- Windows-1255: juniversalchardet detected when it had enough hebrew letters, ICU4J still thought it was ISO-8859-1. With even more hebrew letters, ICU4J detected it as ISO-8859-8 which is the other hebrew encoding(and so the text was OK).
- SHIFT_JIS(Japanese): juniversalchardet detected and ICU4J thought it was ISO-8859-2.
- ISO-8859-1: detected by ICU4J, not supported by juniversalchardet.
- UTF-8:两者都检测到。
- Windows-1255:juniversalchardet 检测到当它有足够的希伯来字母时,ICU4J 仍然认为它是 ISO-8859-1。使用更多的希伯来语字母,ICU4J 将其检测为 ISO-8859-8,这是另一种希伯来语编码(因此文本没问题)。
- SHIFT_JIS(日语):juniversalchardet 检测到,ICU4J 认为它是 ISO-8859-2。
- ISO-8859-1:由 ICU4J 检测,不受 juniversalchardet 支持。
So one should consider which encodings he will most likely have to deal with. In the end I chose ICU4J.
因此,应该考虑他最有可能必须处理哪些编码。最后我选择了ICU4J。
Notice that ICU4J is still maintained.
请注意,仍然维护 ICU4J。
Also notice that you may want to use ICU4J, and in case that it returns null because it didn't succeed, try to use juniversalchardet. Or the opposite.
另请注意,您可能想使用 ICU4J,如果它因为未成功而返回 null,请尝试使用 juniversalchardet。或者相反。
AutoDetectReaderof Apache Tikadoes exactly this - first tries to use HtmlEncodingDetector, then UniversalEncodingDetector(which is based on juniversalchardet), and then tries Icu4jEncodingDetector(based on ICU4J).
AutoDetectReader的阿帕奇提卡正是这样做的-首先尝试使用HtmlEncodingDetector,然后UniversalEncodingDetector(这是基于juniversalchardet),然后尝试Icu4jEncodingDetector(基于ICU4J)。