c# - 如何确定文件是二进制文件还是文本文件？

Question

提问by Pablo Retyk

I need to determine in 80% if a file is binary or text, is there any way to do it even quick and dirty/ugly in c#?

我需要确定 80% 的文件是二进制文件还是文本文件，有什么方法可以在 C# 中快速、肮脏/丑陋地做到这一点？

Answer 1

采纳答案by Ron Warholic

I would probably look for an abundance of control characters which would typically be present in a binary file but rarely in an text file. Binary files tend to use 0 enough that just testing for many 0 bytes would probably be sufficient to catch most files. If you care about localization you'd need to test multi-byte patterns as well.

我可能会寻找大量的控制字符，这些字符通常出现在二进制文件中，但很少出现在文本文件中。二进制文件往往使用 0 足够多，以至于仅测试许多 0 字节可能就足以捕获大多数文件。如果您关心本地化，您还需要测试多字节模式。

As stated though, you can always be unlucky and get a binary file that looks like text or vice versa.

如前所述，你总是会不走运，得到一个看起来像文本的二进制文件，反之亦然。

Answer 2

回答by Jeff Yates

Quick and dirty is to use the file extension and look for common, text extensions such as .txt. For this, you can use the Path.GetExtensioncall. Anything else would not really be classed as "quick", though it may well be dirty.

快速而肮脏的是使用文件扩展名并查找常见的文本扩展名，例如 .txt。为此，您可以使用Path.GetExtension调用。其他任何东西都不会真正归类为“快速”，尽管它很可能很脏。

Answer 3

回答by zvolkov

There's a method called Markov Chains. Scan a few model files of both kinds and for each byte value from 0 to 255 gather stats (basically probability) of a subsequent value. This will give you a 64Kb (256x256) profile you can compare your runtime files against (within a % threshold).

有一种方法叫做马尔可夫链。扫描两种类型的几个模型文件，并为从 0 到 255 的每个字节值收集后续值的统计数据（基本概率）。这将为您提供 64Kb (256x256) 配置文件，您可以将运行时文件与之进行比较（在 % 阈值内）。

Supposedly, this is how browsers' Auto-Detect Encoding feature works.

据说，这就是浏览器的自动检测编码功能的工作原理。

Answer 4

回答by Chad Ruppert

A really really really dirty way would be to build a regex that takes only standard text, punctuation, symbols, and whitespace characters, load up a portion of the file in a text stream, then run it against the regex. Depending on what qualifies as a pure text file in your problem domain, no successful matches would indicate a binary file.

一种非常非常肮脏的方法是构建一个仅采用标准文本、标点符号、符号和空白字符的正则表达式，在文本流中加载文件的一部分，然后针对正则表达式运行它。根据您的问题域中哪些符合纯文本文件的条件，没有成功的匹配将表示二进制文件。

To account for unicode, make sure to mark the encoding on your stream as such.

要考虑 unicode，请确保在您的流上标记编码。

This is really suboptimal, but you said quick and dirty.

这确实是次优的，但你说又快又脏。

Answer 5

回答by foson

http://codesnipers.com/?q=node/68describes how to detect UTF-16 vs. UTF-8 using a Byte Order Mark (which may appear in your file). It also suggests looping through some bytes to see if they conform to the UTF-8 multi-byte sequence pattern (below) to determine if your file is a text file.

http://codesnipers.com/?q=node/68描述了如何使用字节顺序标记（可能出现在您的文件中）检测 UTF-16 与 UTF-8。它还建议循环遍历一些字节以查看它们是否符合 UTF-8 多字节序列模式（如下），以确定您的文件是否为文本文件。

0xxxxxxx ASCII < 0x80 (128)
110xxxxx 10xxxxxx 2-byte >= 0x80
1110xxxx 10xxxxxx 10xxxxxx 3-byte >= 0x400
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 4-byte >= 0x10000

0xxxxxxx ASCII < 0x80 (128)
110xxxxx 10xxxxxx 2 字节 >= 0x80
1110xxxx 10xxxxxx 10xxxxxx 3 字节 >= 0x400
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 4 字节 >= 0x10000

Answer 6

回答by DigitalMindspring

If the real question here is "Can this file be read and written using StreamReader/StreamWriter without modification?", then the answer is here:

如果这里真正的问题是“这个文件可以不加修改地使用StreamReader/StreamWriter读写吗？”，那么答案就在这里：

/// <summary>
/// Detect if a file is text and detect the encoding.
/// </summary>
/// <param name="encoding">
/// The detected encoding.
/// </param>
/// <param name="fileName">
/// The file name.
/// </param>
/// <param name="windowSize">
/// The number of characters to use for testing.
/// </param>
/// <returns>
/// true if the file is text.
/// </returns>
public static bool IsText(out Encoding encoding, string fileName, int windowSize)
{
    using (var fileStream = File.OpenRead(fileName))
    {
    var rawData = new byte[windowSize];
    var text = new char[windowSize];
    var isText = true;

    // Read raw bytes
    var rawLength = fileStream.Read(rawData, 0, rawData.Length);
    fileStream.Seek(0, SeekOrigin.Begin);

    // Detect encoding correctly (from Rick Strahl's blog)
    // http://www.west-wind.com/weblog/posts/2007/Nov/28/Detecting-Text-Encoding-for-StreamReader
    if (rawData[0] == 0xef && rawData[1] == 0xbb && rawData[2] == 0xbf)
    {
        encoding = Encoding.UTF8;
    }
    else if (rawData[0] == 0xfe && rawData[1] == 0xff)
    {
        encoding = Encoding.Unicode;
    }
    else if (rawData[0] == 0 && rawData[1] == 0 && rawData[2] == 0xfe && rawData[3] == 0xff)
    {
        encoding = Encoding.UTF32;
    }
    else if (rawData[0] == 0x2b && rawData[1] == 0x2f && rawData[2] == 0x76)
    {
        encoding = Encoding.UTF7;
    }
    else
    {
        encoding = Encoding.Default;
    }

    // Read text and detect the encoding
    using (var streamReader = new StreamReader(fileStream))
    {
        streamReader.Read(text, 0, text.Length);
    }

    using (var memoryStream = new MemoryStream())
    {
        using (var streamWriter = new StreamWriter(memoryStream, encoding))
        {
        // Write the text to a buffer
        streamWriter.Write(text);
        streamWriter.Flush();

        // Get the buffer from the memory stream for comparision
        var memoryBuffer = memoryStream.GetBuffer();

        // Compare only bytes read
        for (var i = 0; i < rawLength && isText; i++)
        {
            isText = rawData[i] == memoryBuffer[i];
        }
        }
    }

    return isText;
    }
}

Answer 7

回答by shytikov

How about another way: determine length of binary array, representing file's contents and compare it with length of string you will have after converting given binary array to text.

另一种方法如何：确定二进制数组的长度，表示文件的内容，并将其与将给定的二进制数组转换为文本后的字符串长度进行比较。

If length the same, there are no "none-readable' symbols in file, it's text (I'm sure on 80%).

如果长度相同，则文件中没有“不可读”的符号，它是文本（我确定 80%）。

Answer 8

回答by bhavik shah

Sharing my solution in the hope it helps others as it helps me from these posts and forums.

分享我的解决方案，希望它可以帮助其他人，因为它可以从这些帖子和论坛中帮助我。

Background

背景

I have been researching and exploring a solution for the same. However, I expected it to be simple or slightly twisted.

我一直在研究和探索相同的解决方案。但是，我希望它很简单或稍微扭曲。

However, most of the attempts provide convoluted solutions here as well as other sources and dives into Unicode, UTF-series, BOM, Encodings, Byte orders. In the process, I also went off-road and into Ascii Tables and Code pagestoo.

但是，大多数尝试在这里以及其他来源都提供了复杂的解决方案，并深入研究了 Unicode、UTF 系列、BOM、编码、字节顺序。在这个过程中，我也离开了道路，也进入了Ascii 表和代码页。

Anyways, I have come up with a solution based on the idea of stream reader and custom control characters check.

无论如何，我根据流阅读器和自定义控制字符检查的想法提出了一个解决方案。

It is built taking into considerations various hints and tips provided on the forum and elsewhere such as:

它的构建考虑了论坛和其他地方提供的各种提示和技巧，例如：

Check for lot of control characters for example looking for multiple consecutive null characters.
Check for UTF, Unicode, Encodings, BOM, Byte Orders and similar aspects.

检查大量控制字符，例如查找多个连续的空字符。
检查 UTF、Unicode、编码、BOM、字节顺序和类似方面。

My goal is:

我的目标是：

It should not rely on byte orders, encodings and other more involved esoteric work.
It should be relatively easy to implement and easy to understand.
It should work on all types of files.

它不应该依赖字节顺序、编码和其他更复杂的深奥工作。
它应该相对容易实现且易于理解。
它应该适用于所有类型的文件。

The solution presented works for me on test data that includes mp3, eml, txt, info, flv, mp4, pdf, gif, png, jpg. It gives results as expected so far.

提出的解决方案适用于我的测试数据，包括 mp3、eml、txt、info、flv、mp4、pdf、gif、png、jpg。到目前为止，它给出了预期的结果。

How the solution works

解决方案的工作原理

I am relying on the StreamReader default constructorto do what it can do best with respect to determining file encoding related characteristics which uses UTF8Encodingby default.

我依靠StreamReader 默认构造函数来做它可以做的最好的事情，以确定默认使用UTF8Encoding的文件编码相关特征。

I created my own version of check for custom control char condition because Char.IsControldoes not seem useful. It says:

我创建了自己的自定义控制字符条件检查版本，因为Char.IsControl似乎没有用。它说：

Control characters are formatting and other non-printing characters, such as ACK, BEL, CR, FF, LF, and VT. Unicode standard assigns code points from \U0000 to \U001F, \U007F, and from \U0080 to \U009F to control characters. These values are to be interpreted as control characters unless their use is otherwise defined by an application. It considers LF and CR as control characters among other things

控制字符是格式和其他非打印字符，例如 ACK、BEL、CR、FF、LF 和 VT。Unicode 标准从\U0000 到\U001F、\U007F 和从\U0080 到\U009F 分配代码点来控制字符。这些值将被解释为控制字符，除非应用程序另外定义了它们的使用。它将 LF 和 CR 视为控制字符等

That makes it not useful since text files include CR and LF at least.

这使得它没有用，因为文本文件至少包括 CR 和 LF。

Solution

解决方案

static void testBinaryFile(string folderPath)
{
    List<string> output = new List<string>();
    foreach (string filePath in getFiles(folderPath, true))
    {
        output.Add(isBinary(filePath).ToString() + "  ----  " + filePath);
    }
    Clipboard.SetText(string.Join("\n", output), TextDataFormat.Text);
}

public static List<string> getFiles(string path, bool recursive = false)
{
    return Directory.Exists(path) ?
        Directory.GetFiles(path, "*.*",
        recursive ? SearchOption.AllDirectories : SearchOption.TopDirectoryOnly).ToList() :
        new List<string>();
}    

public static bool isBinary(string path)
{
    long length = getSize(path);
    if (length == 0) return false;

    using (StreamReader stream = new StreamReader(path))
    {
        int ch;
        while ((ch = stream.Read()) != -1)
        {
            if (isControlChar(ch))
            {
                return true;
            }
        }
    }
    return false;
}

public static bool isControlChar(int ch)
{
    return (ch > Chars.NUL && ch < Chars.BS)
        || (ch > Chars.CR && ch < Chars.SUB);
}

public static class Chars
{
    public static char NUL = (char)0; // Null char
    public static char BS = (char)8; // Back Space
    public static char CR = (char)13; // Carriage Return
    public static char SUB = (char)26; // Substitute
}

If you try above solution, let me know it works for you or not.

如果您尝试上述解决方案，请告诉我它是否适合您。

回答by Steven de Salas

Great question! I was surprised myself that .NET does not provide an easy solution for this.

好问题！我自己很惊讶 .NET 没有为此提供简单的解决方案。

The following code worked for me to distinguish between images (png, jpg etc) and text files.

以下代码对我有用以区分图像（png、jpg 等）和文本文件。

I just checked for consecutive nulls (0x00) in the first 512 bytes, as per suggestions by Ron Warholic and Adam Bruss:

0x00根据 Ron Warholic 和 Adam Bruss 的建议，我刚刚检查了前 512 个字节中的连续空值 ( )：

if (File.Exists(path))
{
    // Is it binary? Check for consecutive nulls..
    byte[] content = File.ReadAllBytes(path);
    for (int i = 1; i < 512 && i < content.Length; i++) {
        if (content[i] == 0x00 && content[i-1] == 0x00) {
            return Convert.ToBase64String(content);
        }
    }
    // No? return text
    return File.ReadAllText(path);
}

Obviously this is a quick-and-dirty approach, however it can be easily expanded by breaking the file into 10 chunks of 512 bytes each and check 8 one of the them for consecutive nulls (personally, I would deduce its a binary file if 2 or 3 of them match - nulls are rare in text files).

显然这是一种快速而肮脏的方法，但是可以通过将文件分成 10 个块，每个块 512 字节并检查其中 8 个块中的连续空值来轻松扩展它（就个人而言，如果 2或其中 3 个匹配 - 文本文件中很少出现空值）。

That should provide a pretty good solution for what you are after.

这应该为您所追求的提供一个很好的解决方案。

Answer 10

回答by Tyler Long

Another way is to detect the file's charset using UDE. If charset detected successfully, you can be sure that it's text, otherwise it's binary. Because binary has no charset.

另一种方法是使用UDE检测文件的字符集。如果成功检测到字符集，则可以确定它是文本，否则是二进制。因为二进制没有字符集。

Of course you can use other charset detecting library other than UDE. If the charset detecting library is good enough, this approach could achieve 100% correctness.

当然，您可以使用 UDE 以外的其他字符集检测库。如果字符集检测库足够好，这种方法可以达到 100% 的正确率。

c# - 如何确定文件是二进制文件还是文本文件？

提问by Pablo Retyk

采纳答案by Ron Warholic

回答by Jeff Yates

回答by zvolkov

回答by Chad Ruppert

回答by foson

回答by DigitalMindspring

回答by shytikov

回答by bhavik shah

Background

背景

How the solution works

解决方案的工作原理

Solution

解决方案

回答by Steven de Salas

回答by Tyler Long

相关推荐

最近更新

标签

c# - 如何确定文件是二进制文件还是文本文件？

提问by Pablo Retyk

采纳答案by Ron Warholic

回答by Jeff Yates

回答by zvolkov

回答by Chad Ruppert

回答by foson

回答by DigitalMindspring

回答by shytikov

回答by bhavik shah

Background

背景

How the solution works

解决方案的工作原理

Solution

解决方案

回答by Steven de Salas

回答by Tyler Long

相关推荐

linux中杀死tcp连接或会话的命令是什么？

如何以与 shell 无关、与语言无关的方式从命令行 a 获取当前的 Linux 进程 ID

C# 获取硬件信息

ICMP 套接字 (linux)

相关推荐

最近更新

标签