如何使用 Python 识别二进制和文本文件？

Question

提问by Thomas

I need identify which fileis binaryand which is a textin a directory.

我需要确定哪个文件是二进制文件，哪个是目录中的文本。

I tried use mimetypesbut it isnt a good idea in my case because it cant identify all files mimes, and I have strangers ones here... I just need know, binary or text. Simple ? But I couldn′t find a solution...

我尝试使用mimetypes，但在我的情况下这不是一个好主意，因为它无法识别所有文件 mime，而且我这里有陌生人......我只需要知道，二进制或文本。简单的？但我找不到解决方案......

Thanks

谢谢

Answer 1

采纳答案by Thomas

Thanks everybody, I found a solution that suited my problem. I found this code at http://code.activestate.com/recipes/173220/and I changed just a little piece to suit me.

谢谢大家，我找到了一个适合我的问题的解决方案。我在http://code.activestate.com/recipes/173220/找到了这段代码，我只更改了一小部分以适合我。

It works fine.

它工作正常。

from __future__ import division
import string 

def istext(filename):
    s=open(filename).read(512)
    text_characters = "".join(map(chr, range(32, 127)) + list("\n\r\t\b"))
    _null_trans = string.maketrans("", "")
    if not s:
        # Empty files are considered text
        return True
    if "import subprocess
import re

def is_text(fn):
    msg = subprocess.Popen(["file", fn], stdout=subprocess.PIPE).communicate()[0]
    return re.search('text', msg) != None
" in s:
        # Files with null bytes are likely binary
        return False
    # Get the non-text characters (maps a character to itself then
    # use the 'remove' option to get rid of the text characters.)
    t = s.translate(_null_trans, text_characters)
    # If more than 30% non-text characters, then
    # this is considered a binary file
    if float(len(t))/float(len(s)) > 0.30:
        return False
    return True

Answer 2

回答by Jon Skeet

It's inherently notsimple. There's no way of knowing for sure, although you can take a reasonably good guess in most cases.

它本质上并不简单。虽然在大多数情况下您可以做出合理的猜测，但无法确定。

Things you might like to do:

您可能喜欢做的事情：

Look for known magic numbers in binary signatures
Look for the Unicode byte-order-mark at the start of the file
If the file is regularly 00 xx 00 xx 00 xx (for arbitrary xx) or vice versa, that's quite possibly UTF-16
Otherwise, look for 0s in the file; a file with a 0 in is unlikelyto be a single-byte-encoding text file.

在二进制签名中查找已知的幻数
在文件开头查找 Unicode 字节顺序标记
如果文件经常是 00 xx 00 xx 00 xx（对于任意 xx）或反之亦然，那很可能是 UTF-16
否则，在文件中查找 0；带有 0 in 的文件不太可能是单字节编码的文本文件。

But it's all heuristic - it's quite possible to have a file which is a valid text file anda valid image file, for example. It would probably be nonsense as a text file, but legitimate in some encoding or other...

但这都是启发式的 - 例如，很有可能有一个文件是有效的文本文件和有效的图像文件。作为文本文件，它可能是无稽之谈，但在某些编码或其他方面是合法的......

Answer 3

回答by John Paulett

It might be possible to use libmagicto guess the MIME type of the file using python-magic. If you get back something in the "text/*"namespace, it is likely a text file, while anything else is likely a binary file.

可以使用libmagic猜测使用python-magic的文件的 MIME 类型。如果你在“text/*”命名空间中得到一些东西，它可能是一个文本文件，而其他任何东西都可能是一个二进制文件。

Answer 4

回答by Aoife

If your script is running on *nix, you could use something like this:

如果您的脚本在 *nix 上运行，您可以使用以下内容：

##代码##

如何使用 Python 识别二进制和文本文件？

提问by Thomas

采纳答案by Thomas

回答by Jon Skeet

回答by John Paulett

回答by Aoife

相关推荐

最近更新

标签

如何使用 Python 识别二进制和文本文件？

提问by Thomas

采纳答案by Thomas

回答by Jon Skeet

回答by John Paulett

回答by Aoife

相关推荐

PyRo 和 RPyC python 库的优缺点是什么？

python Python图像镜像

python copy.deepcopy 与泡菜

python 有没有办法重新打开套接字？

相关推荐

最近更新

标签