Python 中的 Trie(前缀树)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/960963/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Trie (Prefix Tree) in Python
提问by jacob
I don't know if this is the place to ask about algorithms. But let's see if I get any answers ... :)
不知道这里是不是问算法的地方。但是让我们看看我是否得到任何答案... :)
If anything is unclear I'm very happy to clarify things.
如果有什么不清楚的,我很乐意澄清事情。
I just implemented a Triein python. However, one bit seemed to be more complicated than it ought to (as someone who loves simplicity). Perhaps someone has had a similar problem?
我刚刚在 python 中实现了一个Trie。然而,有一点似乎比它应该的更复杂(作为一个喜欢简单的人)。也许有人遇到过类似的问题?
My aim was to minimize the number of nodes by storing the largest common prefix of a sub-trie in its root. For example, if we had the words stackoverflow, stackbaseand stackbased, then the tree would look something like this:
我的目标是通过在其根中存储子树的最大公共前缀来最小化节点数。例如,如果我们有单词stackoverflow、stackbase和stackbased,那么树看起来像这样:
[s]tack
[o]verflow ______/ \_______ [b]ase
\___ [d]
Note that one can still think of the edges having one character (the first one of the child node).
请注意,人们仍然可以认为边具有一个字符(子节点的第一个字符)。
Find-query is simple to implement. Insertionis not hard, but somewhat more complex than I want.. :(
Find-query 很容易实现。 插入并不难,但比我想要的要复杂一些.. :(
My idea was to insert the keys one after the other (starting from an empty trie), by first searching for the to-be-inserted key k (Find(k)), and then rearranging/splitting the nodes locally at the place where the find-procedure stops. There turn out to be 4 cases: (Let k be the key we want to insert, and k' be the key of the node, where the search ended)
我的想法是一个接一个地插入键(从一个空树开始),首先搜索要插入的键 k(Find(k)),然后在本地重新排列/拆分节点查找过程停止。结果有4种情况:(设k为我们要插入的key,k'为节点的key,搜索结束)
- k is identical to k'
- k is a "proper" prefix of k'
- k' is a "proper" prefix of k
- k and k' share some common prefix, but none of the cases (1), (2) or (3) occur.
- k 与 k' 相同
- k 是 k' 的“正确”前缀
- k' 是 k 的“正确”前缀
- k 和 k' 共享一些公共前缀,但没有出现 (1)、(2) 或 (3) 种情况。
It seems that each of the cases are unique and thus imply different modifications of the Trie. BUT: is it really that complex? Am I missing something? Is there a better approach?
似乎每个案例都是独一无二的,因此暗示了对 Trie 的不同修改。但是:真的有那么复杂吗?我错过了什么吗?有没有更好的方法?
Thanks :)
谢谢 :)
回答by Jason Watkins
At a glance, it sounds like you've implemented a Patricia Trie. This approach also is called path compression in some of the literature. There should be copies of that paper that aren't behind the ACM paywall, which will include an insertion algorithm.
乍一看,您似乎已经实现了Patricia Trie。这种方法在一些文献中也称为路径压缩。该论文的副本应该不在 ACM 付费专区后面,其中将包含插入算法。
There's also another compression method you may want to look at: level compression. The idea behind path compression is to replace strings of single child nodes with a single super node that has a "skip" count. The idea behind level compression is to replace full or nearly full subtrees with a super node with a "degree" count that says how many digits of the key the node decodes. There's also a 3rd approach called width compression, but I'm afraid my memory fails me and I couldn't find a description of it with quick googling.
您可能还想了解另一种压缩方法:级别压缩。路径压缩背后的想法是用具有“跳过”计数的单个超级节点替换单个子节点的字符串。级别压缩背后的想法是用具有“度”计数的超级节点替换完整或接近完整的子树,该“度”计数表示节点解码的密钥的位数。还有一种称为宽度压缩的第三种方法,但恐怕我的记忆力不好,我无法通过快速谷歌搜索找到它的描述。
Level compression can shorten the average path considerably, but insertion and removal algorithms get quite complicated as they need to manage the trie nodes as similarly to dynamic arrays. For the right data sets, level compressed trees can be fast. From what I remember, they're the 2nd fastest approach for storing IP routing tables, the fastest is some sort of hash trie.
级别压缩可以大大缩短平均路径,但是插入和删除算法变得非常复杂,因为它们需要像管理动态数组一样管理特里节点。对于正确的数据集,级别压缩树可以很快。据我所知,它们是存储 IP 路由表的第二快方法,最快的是某种哈希树。
回答by SingleNegationElimination
I don't see anything wrong with your approach. If you're looking for a spike solution, perhaps the action taken in case 4 is actually feasible for the first three cases, IE find the common prefix to k
and k'
and rebuild the node with that in mind. If it happens that the keys were prefixes of one-another, the resulting trie will still be correct, only the implementation did a bit more work than it really had to. but then again, without any code to look at it's hard to say if this works in your case.
我看不出你的方法有什么问题。如果您正在寻找尖峰解决方案,也许在情况 4 中采取的行动对于前三种情况实际上是可行的,IE 会找到 和 的公共前缀,k
并k'
在考虑到这一点的情况下重建节点。如果碰巧键是彼此的前缀,则生成的特里树仍然是正确的,只是实现做了比实际需要更多的工作。但话又说回来,没有任何代码可以查看,很难说这是否适用于您的情况。
回答by Joe Beda
Somewhat of a tangent, but if you are super worried about the number of nodes in your Trie, you may look at joining your word suffixes too. I'd take a look at the DAWG (Directed Acyclic Word Graph) idea: http://en.wikipedia.org/wiki/Directed_acyclic_word_graph
有点切线,但如果您非常担心 Trie 中的节点数量,您也可以考虑加入您的单词后缀。我会看看 DAWG(有向无环词图)的想法:http: //en.wikipedia.org/wiki/Directed_acyclic_word_graph
The downside of these is that they aren't very dynamic and creating them can be difficult. But, if your dictionary is static, they can be super compact.
这些的缺点是它们不是很动态,创建它们可能很困难。但是,如果您的字典是静态的,则它们可以非常紧凑。
回答by Ritesh M Nayak
I have a question regarding your implementation. What is the level of granularity that you decide to split your strings on to make the prefix tree. You could split stack as either s,t,a,c,k or st,ta,ac,ck and many other ngrams of it. Most prefix tree implementations take into account an alphabet for the language, based on this alphabet, you do the splitting.
我对您的实施有疑问。您决定拆分字符串以制作前缀树的粒度级别是多少。您可以将堆栈拆分为 s,t,a,c,k 或 st,ta,ac,ck 以及它的许多其他 ngram。大多数前缀树实现都考虑了语言的字母表,基于这个字母表,你可以进行拆分。
If you were building a prefix tree implementation for python then your alphabets would be things like def, : , if , else... etc
如果您正在为 python 构建前缀树实现,那么您的字母表将是 def, : , if , else... 等
Choosing the right alphabet makes a huge difference in building efficient prefix trees. As for your answers, you could look for PERL packages on CPAN which do longest common substring computation using trie's. You may have some luck there as most of their implementation is pretty robust.
选择正确的字母表对构建高效的前缀树有很大的不同。至于你的答案,你可以在 CPAN 上寻找 PERL 包,它使用 trie 进行最长公共子串计算。你可能有一些运气,因为他们的大部分实现都非常健壮。
回答by Ritesh M Nayak
Look at : Judy-arrays and the python interface at http://www.dalkescientific.com/Python/PyJudy.html
查看:Judy-arrays 和http://www.dalkescientific.com/Python/PyJudy.html 上的 python 接口