C++ 如何将文件内容识别为 ASCII 或二进制
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/277521/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to identify the file content as ASCII or binary
提问by Daniel Cassidy
How do you identify the file content as being in ASCII or binary using C++?
如何使用 C++ 识别文件内容是 ASCII 还是二进制?
回答by Daniel Cassidy
If a file contains onlythe decimal bytes 9–13, 32–126, it's probably a pure ASCII text file. Otherwise, it's not. However, it may still be text in another encoding.
如果文件仅包含十进制字节 9–13、32–126,则它可能是纯 ASCII 文本文件。否则,它不是。但是,它可能仍然是另一种编码的文本。
If, in additionto the above bytes, the file contains onlythe decimal bytes 128–255, it's probably a text file in an 8-bit or variable-length ASCII-based encoding such as ISO-8859-1, UTF-8 or ASCII+Big5. If not, for some purposes you may be able to stop here and consider the file to be binary. However, it may still be text in a 16- or 32-bit encoding.
如果除了上述字节之外,文件只包含十进制字节 128-255,则它可能是一个基于 8 位或可变长度 ASCII 编码的文本文件,例如 ISO-8859-1、UTF-8 或ASCII+Big5。如果不是,出于某些目的,您可以停在此处并将文件视为二进制文件。但是,它可能仍然是 16 位或 32 位编码的文本。
If a file doesn't meet the above constraints, examine the first 2–4 bytes of the file for a byte-order mark:
如果文件不满足上述约束,请检查文件的前 2-4 个字节的字节顺序标记:
- If the first two bytes are hex
FE FF
, the file is tentativelyUTF-16 BE. - If the first two bytes are hex
FF FE
, and the following two bytes are nothex00 00
, the file is tentativelyUTF-16 LE. - If the first four bytes are hex
00 00 FE FF
, the file is tentativelyUTF-32 BE. - If the first four bytes are hex
FF FE 00 00
, the file is tentativelyUTF-32 LE.
- 如果前两个字节是 hex
FE FF
,则文件暂定为UTF-16 BE。 - 如果前两个字节是 hex
FF FE
,而后面两个字节不是hex00 00
,则文件暂定为UTF-16 LE。 - 如果前四个字节是 hex
00 00 FE FF
,则文件暂定为UTF-32 BE。 - 如果前四个字节是 hex
FF FE 00 00
,则文件暂定为UTF-32 LE。
If, through the above checks, you have determined a tentative encoding, then check only for the corresponding encoding below, to ensure that the file is not a binary file which happens to match a byte-order mark.
如果通过上述检查确定了暂定编码,则仅检查下面对应的编码,以确保该文件不是与字节顺序标记匹配的二进制文件。
If you have not determined a tentative encoding, the file might still be a text file in one of these encodings, since the byte-order mark is not mandatory, so check for all encodings in the following list:
如果您尚未确定暂定编码,则该文件可能仍是采用这些编码之一的文本文件,因为字节顺序标记不是强制性的,因此请检查以下列表中的所有编码:
- If the file contains onlybig-endian two-byte words with the decimal values 9–13, 32–126, and 128 or above, the file is probably UTF-16 BE.
- If the file contains onlylittle-endian two-byte words with the decimal values 9–13, 32–126, and 128 or above, the file is probably UTF-16 LE.
- If the file contains onlybig-endian four-byte words with the decimal values 9–13, 32–126, and 128 or above, the file is probably UTF-32 BE.
- If the file contains onlylittle-endian four-byte words with the decimal values 9–13, 32–126, and 128 or above, the file is probably UTF-32 LE.
- 如果文件仅包含十进制值为 9-13、32-126 和 128 或更高的大端二字节字,则该文件可能是 UTF-16 BE。
- 如果文件仅包含十进制值为 9-13、32-126 和 128 或更高的小端二字节字,则该文件可能是 UTF-16 LE。
- 如果文件仅包含十进制值为 9-13、32-126 和 128 或更高的大端四字节字,则该文件可能是 UTF-32 BE。
- 如果文件仅包含十进制值为 9-13、32-126 和 128 或更高的小端四字节字,则该文件可能是 UTF-32 LE。
If, after all these checks, you still haven't determined an encoding, the file isn't a text file in any ASCII-based encoding I know about, so for most purposes you can probably consider it to be binary (it might still be a text file in a non-ASCII encoding such as EBCDIC, but I suspect that's well outside the scope of your concern).
如果在所有这些检查之后,您仍然没有确定编码,则该文件不是我所知道的任何基于 ASCII 编码的文本文件,因此对于大多数用途,您可能会认为它是二进制的(它可能仍然是非 ASCII 编码的文本文件,例如 EBCDIC,但我怀疑这超出了您的关注范围)。
回答by Johannes Schaub - litb
You iterate through it using a normal loop with stream.get(), and check whether the byte values you read are <= 127
. One way of many ways to do it:
您使用带有 stream.get() 的普通循环遍历它,并检查您读取的字节值是否为<= 127
. 多种方法中的一种:
int c;
std::ifstream a("file.txt");
while((c = a.get()) != EOF && c <= 127)
;
if(c == EOF) {
/* file is all ASCII */
}
However, as someone mentioned, all files are binary files after all. Additionally, it's not clear what you mean by "ascii". If you mean the character code, then indeed this is the way you go. But if you mean only alphanumeric values, you would need for another way to go.
但是,正如有人提到的,毕竟所有文件都是二进制文件。此外,不清楚您所说的“ascii”是什么意思。如果您的意思是字符代码,那么确实这就是您要走的路。但是,如果您的意思只是字母数字值,则需要另辟蹊径。
回答by bart
My text editor decides on the presence of null bytes. In practice, that works really well: a binary file with no null bytes is extremely rare.
我的文本编辑器决定是否存在空字节。在实践中,这非常有效:没有空字节的二进制文件非常罕见。
回答by philant
Have a look a how the file commandworks ; it has three strategies to determine the type of a file:
看看file 命令是如何工作的;它有三种策略来确定文件的类型:
- filesystem tests
- magic numbertests
- and language tests
- 文件系统测试
- 幻数测试
- 和语言测试
Depending on your platform, and the possible files you're interested in, you can look at its implementation, or even invoke it.
根据您的平台和您感兴趣的可能文件,您可以查看它的实现,甚至调用它。
回答by Tomalak
The contents of everyfile is binary. So, knowing nothing else, you can't be sure.
每个文件的内容都是二进制的。所以,其他什么都不知道,你不能确定。
ASCII is a matter of interpretation. If you open a binary file in a text editor, you see what I mean.
ASCII 是一个解释问题。如果您在文本编辑器中打开一个二进制文件,您就会明白我的意思。
Most binary files contain a fixed header (per type) you can look for, or you can take the file extension as a hint. You can look for byte order marks if you expect UTF-encoded files, but they are optional as well.
大多数二进制文件都包含您可以查找的固定标头(每种类型),或者您可以将文件扩展名作为提示。如果您需要 UTF 编码的文件,您可以查找字节顺序标记,但它们也是可选的。
Unless you define your question more closely, there can't be a definitive answer.
除非你更仔细地定义你的问题,否则不可能有明确的答案。
回答by David Arno
If the question is genuinely how to detect just ASCII, then litb's answer is spot on. However if san was after knowing how to determine whether the file contains text or not, then the issue becomes way more complex. ASCII is just one - increasingly unpopular - way of representing text. Unicode systems - UTF16, UTF32 and UTF8 have grown in popularity. In theory, they can be easily tested for by checking if the first two bytes are the unicocde byte order mark (BOM) 0xFEFF (or 0xFFFE if the byte order is reversed). However as those two bytes screw up many file formats for Linux systems, they cannot be guaranteed to be there. Further, a binary file might start with 0xFEFF.
如果问题真的是如何仅检测 ASCII,那么 litb 的答案就是正确的。但是,如果 san 知道如何确定文件是否包含文本,那么问题就会变得更加复杂。ASCII 只是一种 - 越来越不受欢迎 - 表示文本的方式。Unicode 系统 - UTF16、UTF32 和 UTF8 越来越受欢迎。理论上,通过检查前两个字节是否为 unicode 字节顺序标记 (BOM) 0xFEFF(如果字节顺序颠倒,则为 0xFFFE),可以轻松测试它们。然而,由于这两个字节搞砸了 Linux 系统的许多文件格式,因此不能保证它们存在。此外,二进制文件可能以 0xFEFF 开头。
Looking for 0x00's (or other control characters) won't help either if the file is unicode. If the file is UFT16 say, and the file contains English text, then every other character will be 0x00.
如果文件是 unicode,查找 0x00(或其他控制字符)也无济于事。如果文件是 UFT16,并且文件包含英文文本,那么每隔一个字符将是 0x00。
If you know the language that the text file will be written in, then it would be possible to analyse the bytes and statistically determine if it contains text or not. For example, the most common letter in English is E followed by T. So if the file contains lots more E's and T's than Z's and X's, it's likely text. Of course it would be necessary to test this as ASCII and the various unicodes to make sure.
如果您知道文本文件将使用的语言,则可以分析字节并统计确定它是否包含文本。例如,英语中最常见的字母是 E 后跟 T。因此,如果文件包含的 E 和 T 比 Z 和 X 多得多,则很可能是文本。当然,有必要将其作为 ASCII 和各种 unicode 进行测试以确保。
If the file isn't written in English - or you want to support multiple languages - then the only two options left are to look at the file extension on Windows and to check the first four bytes against a database of "magic file" codes to determine the file's type and thus whether it contains text or not.
如果文件不是用英语编写的——或者你想支持多种语言——那么剩下的唯一两个选项是查看 Windows 上的文件扩展名,并根据“魔法文件”代码数据库检查前四个字节确定文件的类型,从而确定它是否包含文本。
回答by Shane Powell
This question really has no right or wrong answer to it, just complex solutions that will not work for all possible text files.
这个问题真的没有正确或错误的答案,只是复杂的解决方案,不适用于所有可能的文本文件。
Here is a link the a The Old New Thing Articleon how notepad detects the type of ascii file. It's not perfect, but it's interesting to see how Microsoft handle it.
这是有关记事本如何检测 ascii 文件类型的旧新事物文章的链接。它并不完美,但看看微软如何处理它很有趣。
回答by schnaader
Well, this depends on your definition of ASCII. You can either check for values with ASCII code <128 or for some charset you define (e.g. 'a'-'z','A'-'Z','0'-'9'...) and treat the file as binary if it contains some other characters.
好吧,这取决于您对 ASCII 的定义。您可以检查 ASCII 代码 <128 的值或您定义的某些字符集(例如 'a'-'z','A'-'Z','0'-'9'...)并处理文件如果它包含一些其他字符,则为二进制。
You could also check for regular linebreaks (0x10 or 0x13,0x10) to detect text files.
您还可以检查常规换行符(0x10 或 0x13,0x10)以检测文本文件。
回答by MSalters
To check, you must open the file as binary. You can't open the file as text. ASCII is effectively a subset of binary. After that, you must check the byte values. ASCII has byte values 0-127, but 0-31 are control characters. TAB, CR and LF are the only common control characters. You can't (portably) use 'A' and 'Z'; there's no guarantee those are in ASCII (!). If you need them, you'll have to define.
要进行检查,您必须将文件作为二进制文件打开。您无法以文本形式打开文件。ASCII 实际上是二进制的子集。之后,您必须检查字节值。ASCII 的字节值为 0-127,但 0-31 是控制字符。TAB、CR 和 LF 是唯一常见的控制字符。您不能(可移植地)使用“A”和“Z”;不能保证那些是 ASCII (!)。如果你需要它们,你就必须定义。
const unsigned char ASCII_A = 0x41; // NOT 'A'
const unsigned char ASCII_Z = ASCII_A + 25;
回答by cweiske
Github's linguistuses charlock holmes libraryto detect binary files, which in turn uses ICU's charset detection.
Github 的语言学家使用charlock holmes 库来检测二进制文件,进而使用ICU的字符集检测。
The ICU library is available for many programming languages, including C and Java.
ICU 库可用于许多编程语言,包括 C 和 Java。