在 Java 中获取文件的编码

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3678874/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 02:58:57  来源:igfitidea点击:

Get file's encoding in Java

javaencodingfile-uploadutf-8csv

提问by Simon Guo

Possible Duplicate:
Java : How to determine the correct charset encoding of a stream

可能的重复:
Java:如何确定流的正确字符集编码

User will upload a CSV file to the server, server need to check if the CSV file is encoded as UTF-8. If so need to inform user, (s)he uploaded a wrong encoding file. The problem is how to detect the file user uploaded is UTF-8 encoding? The back end is written in Java. So anyone get the suggestion?

用户将上传一个 CSV 文件到服务器,服务器需要检查 CSV 文件是否编码为 UTF-8。如果是这样需要通知用户,他上传了错误的编码文件。问题是如何检测用户上传的文件是UTF-8编码?后端是用Java编写的。所以有人得到建议吗?

采纳答案by Jerry Coffin

At least in the general case, there's no way to be certain what encoding is used for a file -- the best you can do is a reasonable guess based on heuristics. You can eliminate some possibilities, but at best you're narrowing down the possibilities without confirming any one. For example, most of the ISO 8859 variants allow anybyte value (or pattern of byte values), so almost any content could be encoded with almost any ISO 8859 variant (and I'm only using "almost" out of caution, not any certainty that you could eliminate any of the possibilities).

至少在一般情况下,没有办法确定文件使用了什么编码——你能做的最好的事情是基于启发式进行合理的猜测。您可以消除一些可能性,但充其量只是在不确认任何可能性的情况下缩小可能性。例如,大多数 ISO 8859 变体允许任何字节值(或字节值的模式),因此几乎任何内容都可以用几乎任何 ISO 8859 变体编码(我只是出于谨慎使用“几乎”,而不是任何确定您可以消除任何可能性)。

You can, however, make some reasonable guesses. For example, a file that start out with the three characters of a UTF-8 encoded BOM (EF BB BF), it's probably safe to assume it's really UTF-8. Likewise, if you see sequences like: 110xxxxx 10xxxxxx, it's a pretty fair guess that what you're seeing is encoded with UTF-8. You can eliminate the possibility that something is (correctly) UTF-8 enocded if you ever see a sequence like 110xxxxx 110xxxxx. (110xxxxx is a lead byte of a sequence, which mustbe followed by a non-lead byte, not another lead byte in properly encoded UTF-8).

但是,您可以做出一些合理的猜测。例如,一个以 UTF-8 编码的 BOM (EF BB BF) 的三个字符开头的文件,假设它确实是 UTF-8 可能是安全的。同样,如果您看到类似:110xxxxx 10xxxxxx 的序列,则可以合理地猜测您所看到的内容是用 UTF-8 编码的。如果您看到像 110xxxxx 110xxxxx 这样的序列,则可以消除某些内容(正确)使用 UTF-8 编码的可能性。(110xxxxx 是序列的前导字节,后面必须是非前导字节,而不是正确编码的 UTF-8 中的另一个前导字节)。

回答by yulkes

You can try and guess the encoding using a 3rd party library, for example: http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding

您可以尝试使用 3rd 方库猜测编码,例如:http: //glaforge.free.fr/wiki/index.php?wiki=GuessEncoding

回答by Carlos

Well, you can't. You could show kind of a "preview" (or should I say review?) with some sample data from the file so the user can check if it looks okay. Perhaps with the possibility of selecting different encoding options to help determine the correct one.

嗯,你不能。您可以使用文件中的一些示例数据显示某种“预览”(或者我应该说是评论?),以便用户可以检查它是否正常。也许可以选择不同的编码选项来帮助确定正确的编码选项。