需要内存高效的方式来存储大量字符串(是:Java 中的 HAT-Trie 实现)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2218905/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-29 20:07:27  来源:igfitidea点击:

Need memory efficient way to store tons of strings (was: HAT-Trie implementation in java)

javadata-structureshashtriebloom-filter

提问by hashable

I am working with a large set (5-20 million)of String keys (average length 10 chars)which I need to store in an in memory data structure that supports the following operation in constant time or near constant time:

我正在使用大量(5-20​​00 万)字符串键(平均长度为 10 个字符),我需要将其存储在内存数据结构中,该结构支持恒定时间或接近恒定时间的以下操作:

// Returns true if the input is present in the container, false otherwise
public boolean contains(String input)

Java's Hashmap is proving to be more than satisfactory as far as throughput is concerned but is taking up a lot of memory. I am looking for a solution that is memory efficient and still supports a throughput that is decent (comparable with or nearly as good as hashing).

事实证明,就吞吐量而言,Java 的 Hashmap 非常令人满意,但占用了大量内存。我正在寻找一种内存高效的解决方案,并且仍然支持不错的吞吐量(与散列相比或几乎与散列一样好)。

I don't care about the insertion/deletion times. In my application, I will be performing only insertions (only at startup time) and will subsequently only be querying the data structure using the containsmethod for the life of the application.

我不在乎插入/删除时间。在我的应用程序中,我将仅执行插入(仅在启动时),随后将仅contains在应用程序的生命周期中使用该方法查询数据结构。

I read that the HAT-Trie data structure is closest for my needs. I am wondering if there is a library that has an implementation.

我读到 HAT-Trie 数据结构最适合我的需要。我想知道是否有一个具有实现的库。

Other suggestions with pointers to implementations welcome.

欢迎其他带有实现指针的建议。

Thank You.

谢谢。

采纳答案by Pascal Cuoq

The trie seems like a very good idea for your constraints.

对于您的约束,trie 似乎是一个非常好的主意。

A "thinking outside the box" alternative:

“跳出框框思考”的替代方案:

If you can afford some probability of answering "present" for a string that is absent

如果您有能力为不存在的字符串回答“存在”

EDIT: if you can afford false positives, use a Bloom filteras suggested by WizardOfOdds in the comments.

编辑:如果您能承受误报,请按照 WizardOfOdds 在评论中的建议使用布隆过滤器

For k=1, a Bloom filter is like a hash table without the keys: each "bucket" is simply a boolean that tells if at least one input with the same hash was present. If 1% false positives is acceptable, your hash table can be as small as about 100 * 20 million bits or roughly 200 MiB. For 1 in 1000 false positives, 2GiB.

对于 k=1,布隆过滤器就像一个没有键的哈希表:每个“桶”只是一个布尔值,它告诉是否存在至少一个具有相同哈希值的输入。如果 1% 的误报是可以接受的,那么您的哈希表可以小到大约 100 * 2000 万位或大约 200 MiB。对于千分之一的误报,2GiB。

Using several hash functions instead of one can improve the false positive rate for the same amount of bits.

使用多个散列函数而不是一个可以提高相同位数的误报率。

回答by Darius Bacon

Google brings up a blog post on HAT tries in Java. But I don't see how this will solve your problem directly: the structure is a shallow trie over prefixes of the keys, with the leaves being hashtables holding the suffixes of all keys with the given prefix. So in total, you have a lot of hashtables storing all of the keys that are in your current one big hashtable (perhaps saving a few bytes per key overall because of the common prefixes). Either way, you need a more space-efficient hashtable than the default Java one, or the per-object overhead will hit you just as badly. So why not start with a specialized hashtable class for string keys only, if you take this route, and worry about the trie part only if it still seems worthwhile then?

Google 发布了一篇关于在 Java 中尝试 HAT的博客文章。但是我不知道这将如何直接解决您的问题:结构是对键前缀的浅层树,叶子是散列表,其中包含具有给定前缀的所有键的后缀。所以总的来说,你有很多哈希表存储了当前一个大哈希表中的所有键(由于公共前缀,每个键可能总共节省了几个字节)。无论哪种方式,您都需要一个比默认 Java 哈希表更节省空间的哈希表,否则每个对象的开销也会同样严重。那么为什么不从一个专门用于字符串键的哈希表类开始,如果你走这条路,并且只在它看起来仍然值得时才担心 trie 部分呢?

回答by Justin Peel

Similar to a trie is a ternary search tree, but a ternary search tree has the advantage of using less memory. You can read about ternary search trees here, here, and here. Also one of the main papers on the subject by Jon Bentley and Robert Sedgewick is here. It also talks about sorting strings quickly, so don't be put off by that.

与 trie 类似的是三元搜索树,但三元搜索树具有使用较少内存的优点。您可以在此处此处此处阅读有关三元搜索树的信息。Jon Bentley 和 Robert Sedgewick 关于该主题的主要论文之一也在这里。它还谈到了快速排序字符串,所以不要被它拖延。

回答by Darius Bacon

For space efficiency, O(log(n)) lookup, and simple code, try binary search over an array of characters. 20 million keys of average length 10 makes 200 million characters: 400MB if you need 2 bytes/char; 200MB if you can get away with 1. On top of this you need to somehow represent the boundaries between the keys in the array. If you can reserve a separator character, that's one way; otherwise you might use a parallel array of int offsets.

为了空间效率、O(log(n)) 查找和简单代码,请尝试对字符数组进行二分搜索。平均长度为 10 的 2000 万个密钥可以生成 2 亿个字符:如果需要 2 个字节/字符,则为 400MB;200MB,如果你能逃脱 1。在此之上,你需要以某种方式表示数组中键之间的边界。如果您可以保留分隔符,那是一种方法;否则,您可能会使用 int 偏移量的并行数组。

The simplest variant would use an array of Strings, at a high space cost from per-object overhead. It ought to still beat a hashtable in space efficiency, though not as impressively.

最简单的变体将使用字符串数组,但每个对象的开销都需要很高的空间成本。它应该仍然在空间效率方面击败哈希表,尽管没有那么令人印象深刻。