python python中最短的哈希命名缓存文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1303021/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 21:53:46  来源:igfitidea点击:

Shortest hash in python to name cache files

pythonhash

提问by u0b34a0f6ae

What is the shortest hash (in filename-usable form, like a hexdigest) available in python? My application wants to save cache filesfor some objects. The objects must have unique repr() so they are used to 'seed' the filename. I want to produce a possibly unique filename for each object (not that many). They should not collide, but if they do my app will simply lack cache for that object (and will have to reindex that object's data, a minor cost for the application).

什么是python中可用的最短哈希(以文件名可用的形式,如十六进制)?我的应用程序想要保存某些对象的缓存文件。对象必须具有唯一的 repr() 以便它们用于“播种”文件名。我想为每个对象(不是很多)生成一个可能唯一的文件名。它们不应该发生冲突,但是如果它们发生冲突,我的应用程序只会缺少该对象的缓存(并且必须重新索引该对象的数据,这对应用程序来说是一个很小的成本)。

So, if there is one collision we lose one cache file, but it is the collected savings of caching all objects makes the application startup much faster, so it does not matter much.

因此,如果发生一次碰撞,我们会丢失一个缓存文件,但是缓存所有对象所收集的节省使应用程序启动速度更快,因此无关紧要。

Right now I'm actually using abs(hash(repr(obj))); that's right, the string hash! Haven't found any collisions yet, but I would like to have a better hash function. hashlib.md5 is available in the python library, but the hexdigest is really long if put in a filename. Alternatives, with reasonable collision resistance?

现在我实际上在使用 abs(hash(repr(obj)));没错,字符串哈希!还没有发现任何冲突,但我想要一个更好的哈希函数。 hashlib.md5 在 python 库中可用,但如果放在文件名中,hexdigest 真的很长。具有合理抗碰撞性的替代方案?

Edit: Use case is like this: The data loader gets a new instance of a data-carrying object. Unique types have unique repr. so if a cache file for hash(repr(obj))exists, I unpickle that cache file and replace obj with the unpickled object. If there was a collision and the cache was a false match I notice. So if we don't have cache or have a false match, I instead init obj (reloading its data).

编辑:用例是这样的:数据加载器获取一个数据承载对象的新实例。独特的类型有独特的代表。因此,如果存在缓存文件hash(repr(obj)),我会解压缩该缓存文件并将 obj 替换为解压缩的对象。如果发生冲突并且缓存是错误匹配,我会注意到。因此,如果我们没有缓存或错误匹配,我会改为 init obj(重新加载其数据)。

Conclusions (?)

结论(?)

The strhash in python may be good enough, I was only worried about its collision resistance. But if I can hash 2**16objects with it, it's going to be more than good enough.

strpython中的hash可能已经够好了,我只是担心它的抗碰撞性。但是如果我可以2**16用它散列对象,那就足够了。

I found out how to take a hex hash (from any hash source) and store it compactly with base64:

我发现了如何获取十六进制哈希(来自任何哈希源)并使用 base64 紧凑地存储它:

# 'h' is a string of hex digits 
bytes = "".join(chr(int(h[i:i+2], 16)) for i in xrange(0, len(h), 2))
hashstr = base64.urlsafe_b64encode(bytes).rstrip("=")

回答by Alex Martelli

The birthday paradoxapplies: given a good hash function, the expected number of hashes before a collision occurs is about sqrt(N), where N is the number of different values that the hash function can take. (The wikipedia entry I've pointed to gives the exact formula). So, for example, if you want to use no more than 32 bits, your collision worries are serious for around 64K objects (i.e., 2**16objects -- the square root of the 2**32different values your hash function can take). How many objects do you expect to have, as an order of magnitude?

生日悖论适用:给出了良好的散列函数,在碰撞发生前散列的预期数量为约SQRT(N),其中N是哈希函数可以取不同的值的数目。(我指出的维基百科条目给出了确切的公式)。因此,例如,如果您想使用不超过 32 位的数据,那么对于大约 64K 的对象(即,2**16对象 -2**32散列函数可以采用的不同值的平方根),您的冲突问题很严重。作为一个数量级,您希望拥有多少个对象?

Since you mention that a collision is a minor annoyance, I recommend you aim for a hash length that's roughly the square of the number of objects you'll have, or a bit less but not MUCH less than that.

既然你提到碰撞是一个小烦恼,我建议你的目标是哈希长度大约是你将拥有的对象数量的平方,或者比这个少一点但不会少很多。

You want to make a filename - is that on a case-sensitive filesystem, as typical on Unix, or do you have to cater for case-insensitive systems too? This matters because you aim for short filenames, but the number of bits per character you can use to represent your hash as a filename changes dramatically on case-sensive vs insensitive systems.

你想创建一个文件名 - 是在区分大小写的文件系统上,就像在 Unix 上一样,还是你也必须迎合不区分大小写的系统?这很重要,因为您的目标是短文件名,但是在区分大小写和不敏感的系统上,您可以用来将哈希表示为文件名的每个字符的位数会发生巨大变化。

On a case-sensitive system, you can use the standard library's base64module (I recommend the "urlsafe" version of the encoding, i.e. thisfunction, as avoiding '/' characters that could be present in plain base64 is important in Unix filenames). This gives you 6 usable bits per character, much better than the 4 bits/char in hex.

在区分大小写的系统上,您可以使用标准库的base64模块(我推荐编码的“urlsafe”版本,即这个函数,因为避免可能出现在纯 base64 中的 '/' 字符在 Unix 文件名中很重要)。这为每个字符提供 6 个可用位,比十六进制中的 4 位/字符好得多。

Even on a case-insensitive system, you can still do better than hex -- use base64.b32encode and get 5 bits per character.

即使在不区分大小写的系统上,您仍然可以比十六进制做得更好——使用 base64.b32encode 并获得每个字符 5 位。

These functions take and return strings; use the structmodule to turn numbers into strings if your chosen hash function generates numbers.

这些函数接受并返回字符串;struct如果您选择的哈希函数生成数字,请使用该模块将数字转换为字符串。

If you do have a few tens of thousands of objects I think you'll be fine with builtin hash (32 bits, so 6-7 characters depending on your chosen encoding). For a million objects you'd want 40 bits or so (7 or 8 characters) -- you can fold (xor, don't truncate;-) a sha256 down to a long with a reasonable number of bits, say 128 or so, and use the %operator to cut it further to your desired length before encoding.

如果您确实有数万个对象,我认为您可以使用内置哈希(32 位,6-7 个字符,具体取决于您选择的编码)。对于一百万个对象,您需要 40 位左右(7 或 8 个字符)——您可以将 sha256 折叠(异或,不要截断;-)一个具有合理位数的 long,例如 128 位左右, 并%在编码前使用运算符将其进一步切割成所需的长度。

回答by Martin v. L?wis

The builtin hash function of strings is fairly collision free, and also fairly short. It has 2**32values, so it is fairly unlikely that you encounter collisions (if you use its abs value, it will have only 2**31values).

字符串的内置散列函数相当无冲突,而且相当短。它有2**32值,所以你遇到碰撞的可能性很小(如果你使用它的 abs 值,它将只有2**31值)。

You have been asking for the shortest hash function. That would certainly be

您一直在要求最短的哈希函数。那肯定是

def hash(s):
  return 0

but I guess you didn't really mean it that way...

但我想你不是那个意思...

回答by Ned Batchelder

You can make any hash you like shorter by simply truncating it. md5 is always 32 hex digits, but an arbitrary substring of it (or any other hash) has the proper qualities of a hash: equal values produce equal hashes, and the values are spread around a bunch.

你可以通过简单地截断任何你喜欢的哈希来缩短它。 md5 总是 32 位十六进制数字,但它的任意子串(或任何其他散列)具有散列的适当性质:相等的值产生相等的散列,并且这些值散布在一堆周围。

回答by Matthew Scharley

I'm sure that there's a CRC32 implementation in Python, but that may be too short (8 hex digits). On the upside, it's very quick.

我确定 Python 中有一个 CRC32 实现,但这可能太短了(8 个十六进制数字)。从好的方面来说,它非常快。

Found it, binascii.crc32

找到了,binascii.crc32

回答by gahooa

If you do have a collision, how are you going to tell that it actually happened?

如果确实发生了碰撞,您将如何判断它确实发生了?

If I were you, I would use hashlib to sha1()the repr(), and then just get a limited substring of it (first 16 characters, for example).

如果我是你,我会用hashlib来sha1()repr(),然后就得到了它有限的子串(前16个字符,例如)。

Unless you are talking about huge numbers of these objects, I would suggest that you just use the full hash. Then the opportunity for collision is so, so, so, so small, that you will never live to see it happen (likely).

除非您谈论的是大量这些对象,否则我建议您只使用完整的哈希值。那么碰撞的机会是如此,如此,如此,如此之小,以至于您将永远无法看到它发生(可能)。

Also, if you are dealing with thatmany files, I'm guessing that your caching technique should be adjusted to accommodate it.

另外,如果您要处理那么多文件,我猜您应该调整缓存技术以适应它。

回答by ThomasH

We use hashlib.sha1.hexdigest(), which produces even longer strings, for cache objects with good success. Nobody is actually looking at cache files anyway.

我们使用 hashlib.sha1.hexdigest(),它产生更长的字符串,用于缓存对象并取得成功。无论如何,实际上没有人在查看缓存文件。

回答by mhawke

Condsidering your use case, if you don't have your heart set on using separate cache files and you are not too far down that development path, you might consider using the shelvemodule.

考虑到您的用例,如果您不打算使用单独的缓存文件,并且您在该开发路径上不远,您可以考虑使用该shelve模块。

This will give you a persistent dictionary (stored in a single dbm file) in which you store your objects. Pickling/unpickling is performed transparently, and you don't have to concern yourself with hashing, collisions, file I/O, etc.

这将为您提供一个持久字典(存储在单个 dbm 文件中),您可以在其中存储对象。Pickling/unpickling 是透明执行的,您不必担心散列、冲突、文件 I/O 等。

For the shelve dictionary keys, you would just use repr(obj) and let shelvedeal with stashing your objects for you. A simple example:

对于搁置字典键,您只需使用 repr(obj) 并让shelve处理为您存储对象。一个简单的例子:

import shelve
cache = shelve.open('cache')
t = (1,2,3)
i = 10
cache[repr(t)] = t
cache[repr(i)] = i
print cache
# {'(1, 2, 3)': (1, 2, 3), '10': 10}
cache.close()

cache = shelve.open('cache')
print cache
#>>> {'(1, 2, 3)': (1, 2, 3), '10': 10}
print cache[repr(10)]
#>>> 10

回答by Havenard

Short hashes mean you may have same hash for two different files. Same may happen for big hashes too, but its way more rare. Maybe these file names should vary based on other references, like microtime (unless these files may be created too quickly).

短散列意味着您可能对两个不同的文件具有相同的散列。大哈希也可能发生同样的情况,但这种情况更为罕见。也许这些文件名应该根据其他引用而有所不同,例如 microtime(除非这些文件可能创建得太快)。