python 在python中为大文件创建校验和的最快方法

Question

提问by pixelblender

i need to transfer large files across network and need to create checksum for them on hourly basis. so the speed for generating checksum is critical for me.

我需要通过网络传输大文件，并且需要每小时为它们创建校验和。所以生成校验和的速度对我来说至关重要。

somehow i can't make zlib.crc32 and zlib.adler32 working with files larger than 4GB on Windows XP Pro 64bit machine. i suspect i've hit the 32bit limitation here? using hashlib.md5 i could get a result but the problem is the speed. it takes roughly about 5 minutes to generate an md5 for 4.8GB file. task manager shows that the process is using one core only.

不知何故，我无法让 zlib.crc32 和 zlib.adler32 在 Windows XP Pro 64 位机器上处理大于 4GB 的文件。我怀疑我在这里达到了 32 位限制？使用 hashlib.md5 我可以得到一个结果，但问题是速度。生成 4.8GB 文件的 md5 大约需要大约 5 分钟。任务管理器显示该进程仅使用一个核心。

my questions are:

我的问题是：

is there a way to make crc works on large file? i prefer to use crc than md5
if not then is there a way to speed up the md5.hexdigest()/md5.digest? or in this case any hashlib hexdigest/digest? maybe spliting it into multi thread process? how do i do that?

有没有办法让 crc 在大文件上工作？我更喜欢使用crc而不是md5
如果没有，那么有没有办法加速 md5.hexdigest()/md5.digest？或者在这种情况下任何 hashlib hexdigest/digest？也许将其拆分为多线程进程？我怎么做？

PS: i'm working on somethimg similar like an "Asset Management" system, kind of like svn but the asset consist of large compressed image files. the files have tiny bit incremental changes. the hashing/checksum is needed for detecting changes and error detection.

PS：我正在研究类似于“资产管理”系统的东西，有点像 svn，但资产由大型压缩图像文件组成。这些文件有一点点增量变化。需要散列/校验和来检测更改和错误检测。

Answer 1

回答by mjv

It's an algorithm selection problem, rather than a library/language selection problem!

这是一个算法选择问题，而不是一个库/语言选择问题！

There appears to be two points to consider primarily:

看来主要有两点需要考虑：

how much would the disk I/Oaffect the overall performance?
what is the expected reliability of the error detectionfeature?

磁盘 I/O对整体性能的影响有多大？
错误检测功能的预期可靠性是多少？

Apparently, the answer to the second question is something like 'some false negative allowed' since the reliability of any32 bits hash, relative to a 4Gb message, even in a moderately noisy channel, is not going to be virtually absolute.

显然，第二个问题的答案类似于“允许一些假阴性”，因为任何32 位哈希相对于 4Gb 消息的可靠性，即使在适度嘈杂的通道中，也不会几乎是绝对的。

Assuming that I/O can be improved through multithreading, we may choose a hash that doesn't require a sequential scan of the complete message. Instead we can maybe work the file in parallel, hashing individual sections and either combining the hash values or appending them, to form a longer, more reliable error detection device.

假设可以通过多线程改进 I/O，我们可能会选择不需要对完整消息进行顺序扫描的哈希。相反，我们可以并行处理文件，散列各个部分，然后组合散列值或附加它们，以形成更长、更可靠的错误检测设备。

The next step could be to formalize this handling of files as ordered sections, and to transmit them as such (to be re-glued together at the recipient's end). This approach, along additional information about the way the files are produced (for ex. they may be exclusively modified by append, like log files), may even allow to limit the amount of hash calculation required. The added complexity of this approach needs to weighted against the desire to have zippy fast CRC calculation.

下一步可能是将这种文件处理形式化为有序部分，并按原样传输它们（在接收方的一端重新粘合在一起）。这种方法，连同有关文件生成方式的附加信息（例如，它们可以通过附加专门修改，如日志文件），甚至可以限制所需的散列计算量。这种方法增加的复杂性需要权衡快速 CRC 计算的愿望。

Side note: Alder32 is notlimited to message sizes below a particular threshold. It may just be a limit of the zlib API. (BTW, the reference I found about zlib.adler32 used a buffer, and well... this approach is to be avoided in the context of our huge messages, in favor of streamed processes: read a little from file, calculate, repeat..)

附注：Alder32是不局限于特定阈值之下消息大小。它可能只是 zlib API 的一个限制。（顺便说一句，我发现的关于 zlib.adler32 的参考使用了一个缓冲区，嗯……在我们的大量消息的上下文中应该避免这种方法，有利于流式处理：从文件中读取一点，计算，重复。 .)

Answer 2

回答by Stephen C. Steel

First, there is nothing inherent in any of the CRC algorithms that would prevent them working on an arbitrary length of data (however, a particular implementation might well impose a limit).

首先，任何 CRC 算法都不会阻止它们处理任意长度的数据（但是，特定的实现很可能会施加限制）。

However, in a file syncing application, that probably doesn't matter, as you may not want to hash the entire file when it gets large, just chunks anyway. If you hash the entire file, and the hashes at each end differ, you have to copy the entire file. If you hash fixed sized chunks, then you only have to copy the chunks whose hash has changed. If most of the changes to the files are localized (e.g. database) then this will likely require much less copying (and it' easier to spread per chunk calculations across multiple cores).

但是，在文件同步应用程序中，这可能无关紧要，因为您可能不想在整个文件变大时散列整个文件，无论如何只是块。如果对整个文件进行散列，并且每一端的散列不同，则必须复制整个文件。如果您散列固定大小的块，那么您只需复制散列已更改的块。如果对文件的大部分更改都是本地化的（例如数据库），那么这可能需要更少的复制（并且更容易将每个块的计算分布到多个核心）。

As for the hash algorithm itself, the basic tradeoff is speed vs. lack of collisions (two different data chunks yielding the same hash). CRC-32 is fast, but with only 2^32 unique values, collisions may be seen. MD5 is much slower, but has 2^128 unique values, so collisions will almost never be seen (but are still theoretically possible). The larger hashes (SHA1, SHA256, ...) have even more unique values, but are slower still: I doubt you need them: you're worried about accidental collisions, unlike digital signature applications, where you're worried about deliberately (malicously) engineered collisions.

至于散列算法本身，基本的权衡是速度与没有冲突（两个不同的数据块产生相同的散列）。CRC-32 很快，但只有 2^32 个唯一值，可能会出现冲突。MD5 慢得多，但有 2^128 个唯一值，因此几乎不会出现冲突（但理论上仍然可能）。较大的哈希值（SHA1、SHA256...）具有更多的唯一值，但速度仍然较慢：我怀疑您是否需要它们：您担心意外冲突，不像数字签名应用程序，您故意担心（恶意）设计的碰撞。

It sounds like you're trying to do something very similar to what the rsync utility does. Can you just use rsync?

听起来您正在尝试做一些与 rsync 实用程序非常相似的事情。你可以只使用 rsync 吗？

Answer 3

回答by Calyth

You might be hitting a size limit for files in XP. The 64-bit gives you more addressing space (removing the 2GB (or so) addressing space per application), but probably does nothing for the file size problem.

您可能会遇到 XP 中文件的大小限制。64 位为您提供更多寻址空间（删除每个应用程序 2GB（左右）寻址空间），但可能对文件大小问题没有任何作用。

Answer 4

回答by Anton Gogolev

You cannot possibly use more than one core to calculate MD5 hash of a large file because of the very nature of MD5: it expects a message to be broken up in chunks and fed into hashing function in strict sequence. However, you can use one thread to read a file into internal queue, and then calculate hash in a separate thread so that. I do not think though that this will give you any significant performance boost.

由于 MD5 的本质，您不可能使用多个内核来计算大文件的 MD5 哈希：它期望将消息分成块并以严格的顺序馈入哈希函数。但是，您可以使用一个线程将文件读入内部队列，然后在单独的线程中计算哈希值。我不认为这会给你任何显着的性能提升。

The fact that it takes so long to process a big file might be due to "unbuffered" reads. Try reading, say, 16 Kb at a time and then feed the content in chunks to hashing function.

处理大文件需要很长时间的事实可能是由于“无缓冲”读取。尝试一次读取 16 Kb，然后将内容分块提供给散列函数。

Answer 5

回答by Douglas Leeder

md5 itself can't be run in parallel. However you can md5 the file in sections (in parallel) and the take an md5 of the list of hashes.

md5 本身不能并行运行。但是，您可以按部分（并行）对文件进行 md5 并获取哈希列表的 md5。

However that assumes that the hashing is not IO-limited, which I would suspect it is. As Anton Gogolev suggests - make sure that you're reading the file efficiently (in large power-of-2 chunks). Once you've done that, make sure the file isn't fragmented.

然而，假设散列不受 IO 限制，我怀疑它是。正如 Anton Gogolev 所建议的那样 - 确保您正在有效地读取文件（以 2 次幂的大块形式）。完成后，请确保文件没有碎片。

Also a hash such as sha256 should be selected rather than md5 for new projects.

对于新项目，还应选择诸如 sha256 之类的哈希而不是 md5。

Are the zlib checksums much faster than md5 for 4Gb files?

对于 4Gb 文件，zlib 校验和是否比 md5 快得多？

Answer 6

回答by Brian

Did you try the crc-generatormodule?

您是否尝试过crc-generator模块？

python 在python中为大文件创建校验和的最快方法

提问by pixelblender

回答by mjv

回答by Stephen C. Steel

回答by Calyth

回答by Anton Gogolev

回答by Douglas Leeder

回答by Brian

相关推荐

最近更新

标签

python 在python中为大文件创建校验和的最快方法

提问by pixelblender

回答by mjv

回答by Stephen C. Steel

回答by Calyth

回答by Anton Gogolev

回答by Douglas Leeder

回答by Brian

相关推荐

python 如何在python中下载具有正确字符集的任何（！）网页？

python 如何以编程方式获取 SVN 修订号？

python Django：如何为内联模型表单集中的字段设置初始值？

在 Python 中使用 BeautifulSoup 解析数据

相关推荐

最近更新

标签