如何使用 Python 识别二进制和文本文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1446549/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 22:15:47  来源:igfitidea点击:

How to identify binary and text files using Python?

pythontextbinaryfile-type

提问by Thomas

I need identify which fileis binaryand which is a textin a directory.

我需要确定哪个文件二进制文件,哪个是目录中的文本

I tried use mimetypesbut it isnt a good idea in my case because it cant identify all files mimes, and I have strangers ones here... I just need know, binary or text. Simple ? But I couldn′t find a solution...

我尝试使用mimetypes,但在我的情况下这不是一个好主意,因为它无法识别所有文件 mime,而且我这里有陌生人......我只需要知道,二进制或文本。简单的 ?但我找不到解决方案......

Thanks

谢谢

采纳答案by Thomas

Thanks everybody, I found a solution that suited my problem. I found this code at http://code.activestate.com/recipes/173220/and I changed just a little piece to suit me.

谢谢大家,我找到了一个适合我的问题的解决方案。我在http://code.activestate.com/recipes/173220/找到了这段代码,我只更改了一小部分以适合我。

It works fine.

它工作正常。

from __future__ import division
import string 

def istext(filename):
    s=open(filename).read(512)
    text_characters = "".join(map(chr, range(32, 127)) + list("\n\r\t\b"))
    _null_trans = string.maketrans("", "")
    if not s:
        # Empty files are considered text
        return True
    if "
import subprocess
import re

def is_text(fn):
    msg = subprocess.Popen(["file", fn], stdout=subprocess.PIPE).communicate()[0]
    return re.search('text', msg) != None
" in s: # Files with null bytes are likely binary return False # Get the non-text characters (maps a character to itself then # use the 'remove' option to get rid of the text characters.) t = s.translate(_null_trans, text_characters) # If more than 30% non-text characters, then # this is considered a binary file if float(len(t))/float(len(s)) > 0.30: return False return True

回答by Jon Skeet

It's inherently notsimple. There's no way of knowing for sure, although you can take a reasonably good guess in most cases.

它本质上并不简单。虽然在大多数情况下您可以做出合理的猜测,但无法确定。

Things you might like to do:

您可能喜欢做的事情:

  • Look for known magic numbers in binary signatures
  • Look for the Unicode byte-order-mark at the start of the file
  • If the file is regularly 00 xx 00 xx 00 xx (for arbitrary xx) or vice versa, that's quite possibly UTF-16
  • Otherwise, look for 0s in the file; a file with a 0 in is unlikelyto be a single-byte-encoding text file.
  • 在二进制签名中查找已知的幻数
  • 在文件开头查找 Unicode 字节顺序标记
  • 如果文件经常是 00 xx 00 xx 00 xx(对于任意 xx)或反之亦然,那很可能是 UTF-16
  • 否则,在文件中查找 0;带有 0 in 的文件不太可能是单字节编码的文本文件。

But it's all heuristic - it's quite possible to have a file which is a valid text file anda valid image file, for example. It would probably be nonsense as a text file, but legitimate in some encoding or other...

但这都是启发式的 - 例如,很有可能有一个文件是有效的文本文件有效的图像文件。作为文本文件,它可能是无稽之谈,但在某些编码或其他方面是合法的......

回答by John Paulett

It might be possible to use libmagicto guess the MIME type of the file using python-magic. If you get back something in the "text/*"namespace, it is likely a text file, while anything else is likely a binary file.

可以使用libmagic猜测使用python-magic的文件的 MIME 类型。如果你在“text/*”命名空间中得到一些东西,它可能是一个文本文件,而其他任何东西都可能是一个二进制文件

回答by Aoife

If your script is running on *nix, you could use something like this:

如果您的脚本在 *nix 上运行,您可以使用以下内容:

##代码##