在 Python 中散列文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22058048/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:09:38  来源:igfitidea点击:

Hashing a file in Python

pythonhashmd5sha1hashlib

提问by user3358300

I want python to read to the EOF so I can get an appropriate hash, whether it is sha1 or md5. Please help. Here is what I have so far:

我希望 python 读取 EOF,这样我就可以获得适当的哈希值,无论是 sha1 还是 md5。请帮忙。这是我到目前为止所拥有的:

import hashlib

inputFile = raw_input("Enter the name of the file:")
openedFile = open(inputFile)
readFile = openedFile.read()

md5Hash = hashlib.md5(readFile)
md5Hashed = md5Hash.hexdigest()

sha1Hash = hashlib.sha1(readFile)
sha1Hashed = sha1Hash.hexdigest()

print "File Name: %s" % inputFile
print "MD5: %r" % md5Hashed
print "SHA1: %r" % sha1Hashed

回答by Randall Hunt

TL;DR use buffers to not use tons of memory.

TL;DR 使用缓冲区来不使用大量内存。

We get to the crux of your problem, I believe, when we consider the memory implications of working with very large files. We don't want this bad boy to churn through 2 gigs of ram for a 2 gigabyte file so, as pasztorpistipoints out, we gotta deal with those bigger files in chunks!

我相信,当我们考虑处理非常大文件的内存影响时,我们就找到了问题的症结所在。我们不希望这个坏男孩为 2 GB 的文件运行 2 gig的内存,因此,正如pasztorpisti指出的那样,我们必须分块处理那些更大的文件!

import sys
import hashlib

# BUF_SIZE is totally arbitrary, change for your app!
BUF_SIZE = 65536  # lets read stuff in 64kb chunks!

md5 = hashlib.md5()
sha1 = hashlib.sha1()

with open(sys.argv[1], 'rb') as f:
    while True:
        data = f.read(BUF_SIZE)
        if not data:
            break
        md5.update(data)
        sha1.update(data)

print("MD5: {0}".format(md5.hexdigest()))
print("SHA1: {0}".format(sha1.hexdigest()))

What we've done is we're updating our hashes of this bad boy in 64kb chunks as we go along with hashlib's handy dandy update method. This way we use a lot less memory than the 2gb it would take to hash the guy all at once!

我们所做的是以 64kb 块更新这个坏男孩的哈希值,同时使用 hashlib 的方便 dandy update 方法。通过这种方式,我们使用的内存比一次性散列这个家伙所需的 2gb 少得多!

You can test this with:

您可以使用以下方法进行测试:

$ mkfile 2g bigfile
$ python hashes.py bigfile
MD5: a981130cf2b7e09f4686dc273cf7187e
SHA1: 91d50642dd930e9542c39d36f0516d45f4e1af0d
$ md5 bigfile
MD5 (bigfile) = a981130cf2b7e09f4686dc273cf7187e
$ shasum bigfile
91d50642dd930e9542c39d36f0516d45f4e1af0d  bigfile

Hope that helps!

希望有帮助!

Also all of this is outlined in the linked question on the right hand side: Get MD5 hash of big files in Python

此外,所有这些都在右侧的链接问题中进行了概述:Get MD5 hash of big files in Python



Addendum!

附录!

In general when writing python it helps to get into the habit of following pep-8. For example, in python variables are typically underscore separated not camelCased. But that's just style and no one really cares about those things except people who have to read bad style... which might be you reading this code years from now.

一般来说,在编写 python 时,养成遵循pep-8的习惯是有帮助的。例如,在 python 中变量通常是下划线分隔的,而不是驼峰式的。但这只是风格,除了那些必须阅读糟糕风格的人之外,没有人真正关心这些事情......这可能是你在几年后阅读这段代码。

回答by maxschlepzig

For the correct and efficient computation of the hash value of a file (in Python 3):

为了正确有效地计算文件的哈希值(在 Python 3 中):

  • Open the file in binary mode (i.e. add 'b'to the filemode) to avoid character encoding and line-ending conversion issues.
  • Don't read the complete file into memory, since that is a waste of memory. Instead, sequentially read it block by block and update the hash for each block.
  • Eliminate double buffering, i.e. don't use buffered IO, because we already use an optimal block size.
  • Use readinto()to avoid buffer churning.
  • 以二进制模式打开文件(即添加'b'到文件模式)以避免字符编码和行尾转换问题。
  • 不要将整个文件读入内存,因为那样会浪费内存。相反,逐块顺序读取它并更新每个块的哈希。
  • 消除双缓冲,即不使用缓冲 IO,因为我们已经使用了最佳块大小。
  • 使用readinto()以避免缓冲区翻腾。

Example:

例子:

import hashlib

def sha256sum(filename):
    h  = hashlib.sha256()
    b  = bytearray(128*1024)
    mv = memoryview(b)
    with open(filename, 'rb', buffering=0) as f:
        for n in iter(lambda : f.readinto(mv), 0):
            h.update(mv[:n])
    return h.hexdigest()

回答by phyyyl

I have programmed a module wich is able to hash big files with different algorithms.

我编写了一个模块,它能够用不同的算法散列大文件。

pip3 install py_essentials

Use the module like this:

像这样使用模块:

from py_essentials import hashing as hs
hash = hs.fileChecksum("path/to/the/file.txt", "sha256")

回答by Ome Mishra

import hashlib
user = input("Enter ")
h = hashlib.md5(user.encode())
h2 = h.hexdigest()
with open("encrypted.txt","w") as e:
    print(h2,file=e)


with open("encrypted.txt","r") as e:
    p = e.readline().strip()
    print(p)

回答by Mitar

I would propose simply:

我会简单地建议:

def get_digest(file_path):
    h = hashlib.sha256()

    with open(file_path, 'rb') as file:
        while True:
            # Reading is buffered, so we can read smaller chunks.
            chunk = file.read(h.block_size)
            if not chunk:
                break
            h.update(chunk)

    return h.hexdigest()

All other answers here seem to complicate too much. Python is already buffering when reading (in ideal manner, or you configure that buffering if you have more information about underlying storage) and so it is better to read in chunks the hash function finds ideal which makes it faster or at lest less CPU intensive to compute the hash function. So instead of disabling buffering and trying to emulate it yourself, you use Python buffering and control what you should be controlling: what the consumer of your data finds ideal, hash block size.

这里的所有其他答案似乎都太复杂了。Python 在读取时已经在缓冲(以理想的方式,或者如果您有更多关于底层存储的信息,您可以配置该缓冲),因此最好分块读取散列函数认为理想的块,这使得它更快,或者至少减少 CPU 密集度计算哈希函数。因此,与其禁用缓冲并尝试自己模拟它,不如使用 Python 缓冲并控制您应该控制的内容:数据的使用者认为理想的哈希块大小。