在 Java 中确定二进制/文本文件类型?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/620993/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Determining binary/text file type in Java?
提问by yanchenko
Namely, how would you tell an archive (jar/rar/etc.) file from a textual (xml/txt, encoding-independent) one?
即,您如何从文本(xml/txt,与编码无关)文件中区分存档(jar/rar/etc.)文件?
采纳答案by Aric TenEyck
There's no guaranteed way, but here are a couple of possibilities:
没有保证的方法,但这里有几种可能性:
1) Look for a header on the file. Unfortunately, headers are file-specific, so while you might be able to find out that it's a RAR file, you won't get the more generic answer of whether it's text or binary.
1) 在文件上查找标题。不幸的是,头文件是特定于文件的,因此虽然您可能会发现它是一个 RAR 文件,但您不会得到更通用的答案,即它是文本文件还是二进制文件。
2) Count the number of character vs. non-character types. Text files will be mostly alphabetical characters while binary files - especially compressed ones like rar, zip, and such - will tend to have bytes more evenly represented.
2)计算字符与非字符类型的数量。文本文件将主要是字母字符,而二进制文件 - 特别是像 rar、zip 等压缩文件 - 将倾向于更均匀地表示字节。
3) Look for a regularly repeating pattern of newlines.
3) 寻找定期重复的换行符模式。
回答by MarkusQ
回答by Matthew
If the file consists of the bytes 0x09 (tab), 0x0A (line feed), 0x0C (form feed), 0x0D (carriage return), or 0x20 through 0x7E, then it's probably ASCII text.
如果文件由字节 0x09(制表符)、0x0A(换行)、0x0C(换页)、0x0D(回车)或 0x20 到 0x7E 组成,则它可能是 ASCII 文本。
If the file contains any other ASCII control character, 0x00 through 0x1F excluding the three above, then it's probably binary data.
如果文件包含任何其他 ASCII 控制字符,从 0x00 到 0x1F,不包括上述三个,那么它可能是二进制数据。
UTF-8 text follows a very specific pattern for any bytes with the high order bit, but fixed-length encodings like ISO-8859-1 do not. UTF-16 can frequently contain the null byte (0x00), but only in every other position.
UTF-8 文本对于任何具有高位的字节都遵循非常特定的模式,但诸如 ISO-8859-1 之类的固定长度编码则不然。UTF-16 可以经常包含空字节 (0x00),但只能在其他位置。
You'd need a weaker heuristic for anything else.
对于其他任何事情,您都需要较弱的启发式方法。
回答by Daniel Hiller
Have a look at the JMimeMagiclibrary.
看看JMimeMagic库。
jMimeMagic is a Java library for determining the MIME type of files or streams.
jMimeMagic 是一个 Java 库,用于确定文件或流的 MIME 类型。
回答by yanchenko
Just to let you know, I've chosen quite a different path. I my case, there are only 2 types of files, chances that any given file will be a binary one are high. So
只是为了让你知道,我选择了一条完全不同的道路。我的情况是,只有两种类型的文件,任何给定文件都是二进制文件的可能性很高。所以
- presume that file is binary, try doing what's supposed to be done (e.g. deserialize)
- catch exception
- treat file as textual
- if that fails, something is wrong with file itself
- 假设文件是二进制文件,尝试做应该做的事情(例如反序列化)
- 捕捉异常
- 将文件视为文本
- 如果失败,则文件本身有问题
回答by Wilfred Springer
Run file -bi {filename}
. If whatever it returns starts with 'text/', then it's non-binary, otherwise it is. ;-)
运行file -bi {filename}
。如果它返回的任何内容以 'text/' 开头,则它是非二进制的,否则是。;-)
回答by Michael von Wenckstern
I used this code and it works for English and German text pretty well:
我使用了这个代码,它适用于英语和德语文本:
private boolean isTextFile(String filePath) throws Exception {
File f = new File(filePath);
if(!f.exists())
return false;
FileInputStream in = new FileInputStream(f);
int size = in.available();
if(size > 1000)
size = 1000;
byte[] data = new byte[size];
in.read(data);
in.close();
String s = new String(data, "ISO-8859-1");
String s2 = s.replaceAll(
"[a-zA-Z0-9???ü\.\*!\"§\$\%&/()=\?@~'#:,;\"+
"+><\|\[\]\{\}\^°23\\ \n\r\t_\-`′aê??"+
"?ê??áéíóàèìòáéíóàèìò?‰¢£¥±???????a]", "");
// will delete all text signs
double d = (double)(s.length() - s2.length()) / (double)(s.length());
// percentage of text signs in the text
return d > 0.95;
}
回答by Ondra ?i?ka
I made this one. A bit simpler, but for latin-based languages, it should work fine, with the ratio adjustment.
我做了这个。稍微简单一点,但对于基于拉丁语的语言,它应该可以正常工作,并进行比率调整。
/**
* Guess whether given file is binary. Just checks for anything under 0x09.
*/
public static boolean isBinaryFile(File f) throws FileNotFoundException, IOException {
FileInputStream in = new FileInputStream(f);
int size = in.available();
if(size > 1024) size = 1024;
byte[] data = new byte[size];
in.read(data);
in.close();
int ascii = 0;
int other = 0;
for(int i = 0; i < data.length; i++) {
byte b = data[i];
if( b < 0x09 ) return true;
if( b == 0x09 || b == 0x0A || b == 0x0C || b == 0x0D ) ascii++;
else if( b >= 0x20 && b <= 0x7E ) ascii++;
else other++;
}
if( other == 0 ) return false;
return 100 * other / (ascii + other) > 95;
}
回答by rince
Using Java 7 Files class http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#probeContentType(java.nio.file.Path)
使用 Java 7 文件类http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#probeContentType(java.nio.file.Path)
boolean isBinaryFile(File f) throws IOException {
String type = Files.probeContentType(f.toPath());
if (type == null) {
//type couldn't be determined, assume binary
return true;
} else if (type.startsWith("text")) {
return false;
} else {
//type isn't text
return true;
}
}