Python 生成文件的 MD5 校验和
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3431825/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Generating an MD5 checksum of a file
提问by Alexander
Is there any simple way of generating (and checking) MD5 checksums of a list of files in Python? (I have a small program I'm working on, and I'd like to confirm the checksums of the files).
是否有任何简单的方法可以在 Python 中生成(和检查)文件列表的 MD5 校验和?(我正在开发一个小程序,我想确认文件的校验和)。
采纳答案by quantumSoup
You can use hashlib.md5()
您可以使用hashlib.md5()
Note that sometimes you won't be able to fit the whole file in memory. In that case, you'll have to read chunks of 4096 bytes sequentially and feed them to the md5method:
请注意,有时您将无法将整个文件放入内存中。在这种情况下,您必须按顺序读取 4096 字节的块并将它们提供给md5方法:
import hashlib
def md5(fname):
hash_md5 = hashlib.md5()
with open(fname, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
Note:hash_md5.hexdigest()will return the hex stringrepresentation for the digest, if you just need the packed bytes use return hash_md5.digest(), so you don't have to convert back.
注意:hash_md5.hexdigest()将返回摘要的十六进制字符串表示形式,如果您只需要使用压缩字节return hash_md5.digest(),则不必转换回来。
回答by Omnifarious
There is a way that's pretty memory inefficient.
有一种方法内存效率很低。
single file:
单个文件:
import hashlib
def file_as_bytes(file):
with file:
return file.read()
print hashlib.md5(file_as_bytes(open(full_path, 'rb'))).hexdigest()
list of files:
文件列表:
[(fname, hashlib.md5(file_as_bytes(open(fname, 'rb'))).digest()) for fname in fnamelst]
Recall though, that MD5 is known brokenand should not be used for any purpose since vulnerability analysis can be really tricky, and analyzing any possible future use your code might be put to for security issues is impossible. IMHO, it should be flat out removed from the library so everybody who uses it is forced to update. So, here's what you should do instead:
回想一下,MD5 已知已损坏,不应用于任何目的,因为漏洞分析可能非常棘手,并且分析您的代码可能用于安全问题的任何可能的未来用途是不可能的。恕我直言,它应该从库中彻底删除,以便每个使用它的人都被迫更新。所以,这是你应该做的:
[(fname, hashlib.sha256(file_as_bytes(open(fname, 'rb'))).digest()) for fname in fnamelst]
If you only want 128 bits worth of digest you can do .digest()[:16].
如果你只想要 128 位的摘要,你可以做.digest()[:16].
This will give you a list of tuples, each tuple containing the name of its file and its hash.
这会给你一个元组列表,每个元组包含它的文件名和它的散列。
Again I strongly question your use of MD5. You should be at least using SHA1, and given recent flaws discovered in SHA1, probably not even that. Some people think that as long as you're not using MD5 for 'cryptographic' purposes, you're fine. But stuff has a tendency to end up being broader in scope than you initially expect, and your casual vulnerability analysis may prove completely flawed. It's best to just get in the habit of using the right algorithm out of the gate. It's just typing a different bunch of letters is all. It's not that hard.
我再次强烈质疑您对 MD5 的使用。您至少应该使用 SHA1,并且考虑到最近在 SHA1 中发现的缺陷,甚至可能不是这样。有些人认为只要您不将 MD5 用于“加密”目的,就可以。但是,事情的最终范围往往比您最初预期的要广泛,而您随意的漏洞分析可能会被证明是完全有缺陷的。最好养成使用正确算法的习惯。只是输入一组不同的字母而已。这并不难。
Here is a way that is more complex, but memory efficient:
这是一种更复杂但内存高效的方法:
import hashlib
def hash_bytestr_iter(bytesiter, hasher, ashexstr=False):
for block in bytesiter:
hasher.update(block)
return hasher.hexdigest() if ashexstr else hasher.digest()
def file_as_blockiter(afile, blocksize=65536):
with afile:
block = afile.read(blocksize)
while len(block) > 0:
yield block
block = afile.read(blocksize)
[(fname, hash_bytestr_iter(file_as_blockiter(open(fname, 'rb')), hashlib.md5()))
for fname in fnamelst]
And, again, since MD5 is broken and should not really ever be used anymore:
而且,再一次,由于 MD5 已损坏,不应再真正使用:
[(fname, hash_bytestr_iter(file_as_blockiter(open(fname, 'rb')), hashlib.sha256()))
for fname in fnamelst]
Again, you can put [:16]after the call to hash_bytestr_iter(...)if you only want 128 bits worth of digest.
同样,如果您只需要 128 位的摘要,您可以[:16]在调用后放置hash_bytestr_iter(...)。
回答by rsandwick3
I'm clearly not adding anything fundamentally new, but added this answer before I was up to commenting status, plus the code regions make things more clear -- anyway, specifically to answer @Nemo's question from Omnifarious's answer:
我显然没有添加任何根本性的新内容,但是在我评论状态之前添加了这个答案,加上代码区域使事情变得更加清晰——无论如何,特别是从 Omnifarious 的答案中回答@Nemo 的问题:
I happened to be thinking about checksums a bit (came here looking for suggestions on block sizes, specifically), and have found that this method may be faster than you'd expect. Taking the fastest (but pretty typical) timeit.timeitor /usr/bin/timeresult from each of several methods of checksumming a file of approx. 11MB:
我碰巧在考虑校验和(来到这里是为了寻找有关块大小的建议,特别是),并且发现这种方法可能比您预期的要快。取最快(但非常典型)timeit.timeit或/usr/bin/time从校验和的文件的每种方法中的每一种的结果。11MB:
$ ./sum_methods.py
crc32_mmap(filename) 0.0241742134094
crc32_read(filename) 0.0219960212708
subprocess.check_output(['cksum', filename]) 0.0553209781647
md5sum_mmap(filename) 0.0286180973053
md5sum_read(filename) 0.0311000347137
subprocess.check_output(['md5sum', filename]) 0.0332629680634
$ time md5sum /tmp/test.data.300k
d3fe3d5d4c2460b5daacc30c6efbc77f /tmp/test.data.300k
real 0m0.043s
user 0m0.032s
sys 0m0.010s
$ stat -c '%s' /tmp/test.data.300k
11890400
So, looks like both Python and /usr/bin/md5sum take about 30ms for an 11MB file. The relevant md5sumfunction (md5sum_readin the above listing) is pretty similar to Omnifarious's:
因此,对于 11MB 的文件,看起来 Python 和 /usr/bin/md5sum 都需要大约 30 毫秒。相关md5sum函数(md5sum_read在上面的清单中)与 Omnifarious 的非常相似:
import hashlib
def md5sum(filename, blocksize=65536):
hash = hashlib.md5()
with open(filename, "rb") as f:
for block in iter(lambda: f.read(blocksize), b""):
hash.update(block)
return hash.hexdigest()
Granted, these are from single runs (the mmapones are always a smidge faster when at least a few dozen runs are made), and mine's usually got an extra f.read(blocksize)after the buffer is exhausted, but it's reasonably repeatable and shows that md5sumon the command line is not necessarily faster than a Python implementation...
诚然,这些来自单次运行(mmap至少进行几十次运行时,它们总是快一点),并且我的通常f.read(blocksize)在缓冲区耗尽后得到额外的,但它是合理可重复的,并显示md5sum在命令行上是不一定比 Python 实现更快......
EDIT: Sorry for the long delay, haven't looked at this in some time, but to answer @EdRandall's question, I'll write down an Adler32 implementation. However, I haven't run the benchmarks for it. It's basically the same as the CRC32 would have been: instead of the init, update, and digest calls, everything is a zlib.adler32()call:
编辑:抱歉,拖延了很长时间,有一段时间没看这个了,但要回答@EdRandall 的问题,我会写下 Adler32 实现。但是,我还没有运行它的基准测试。它与 CRC32 基本相同:代替 init、update 和digest 调用,一切都是zlib.adler32()调用:
import zlib
def adler32sum(filename, blocksize=65536):
checksum = zlib.adler32("")
with open(filename, "rb") as f:
for block in iter(lambda: f.read(blocksize), b""):
checksum = zlib.adler32(block, checksum)
return checksum & 0xffffffff
Note that this must start off with the empty string, as Adler sums do indeed differ when starting from zero versus their sum for "", which is 1-- CRC can start with 0instead. The AND-ing is needed to make it a 32-bit unsigned integer, which ensures it returns the same value across Python versions.
请注意,这必须从空字符串开始,因为 Adler 和确实从零开始时与它们的和 for 不同"",即1- CRC 可以0改为开始。AND需要使用-ing 使其成为 32 位无符号整数,以确保它在 Python 版本中返回相同的值。
回答by johnson
hashlib.md5(pathlib.Path('path/to/file').read_bytes()).hexdigest()
回答by Puchatek
I think relying on invoke package and md5sum binary is a bit more convenient than subprocess or md5 package
我认为依靠 invoke 包和 md5sum 二进制比 subprocess 或 md5 包方便一点
import invoke
def get_file_hash(path):
return invoke.Context().run("md5sum {}".format(path), hide=True).stdout.split(" ")[0]
This of course assumes you have invoke and md5sum installed.
这当然假设您已经安装了 invoke 和 md5sum。
回答by Boris
In Python 3.8+ you can do
在 Python 3.8+ 中你可以做
import hashlib
with open("your_filename.txt", "rb") as f:
file_hash = hashlib.md5()
while chunk := f.read(8192):
file_hash.update(chunk)
print(file_hash.digest())
print(file_hash.hexdigest()) # to get a printable str instead of bytes
Consider using hashlib.blake2binstead of md5(just replace md5with blake2bin the above snippet). It's cryptographically secure and fasterthan MD5.
考虑使用hashlib.blake2b而不是md5(只需在上面的代码片段中替换md5为blake2b)。它在密码学上是安全的并且比 MD5更快。

