Java Trie vs. 后缀树 vs. 后缀数组
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2487576/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Trie vs. suffix tree vs. suffix array
提问by David Campos
Which structure provides the best performance results; trie (prefix tree), suffix tree or suffix array? Are there other similar structures? What are good Java implementations of these structures?
哪种结构提供了最好的性能结果;trie(前缀树)、后缀树还是后缀数组?还有其他类似的结构吗?这些结构有哪些好的 Java 实现?
Edit: in this case I want to make string matching between a large dictionary of names and a large set of natural language texts, in order to identify the names of the dictionary on texts.
编辑:在这种情况下,我想在大型名称词典和大型自然语言文本集之间进行字符串匹配,以识别文本词典的名称。
采纳答案by Miguel Figueiredo
The trie was the first data structure of this kind discovered.
trie 是第一个发现的此类数据结构。
The suffix tree is an improvement over the trie (it has suffix links which allow linear error search, the suffix tree trims unnecessary branches of the trie therefore it does not require as much space).
后缀树是对 trie 的改进(它具有允许线性错误搜索的后缀链接,后缀树修剪了 trie 的不必要分支,因此它不需要那么多空间)。
The suffix array is a stripped down data structure based on the suffix tree (no suffix links (slow error matches), yet pattern matching is very fast).
后缀数组是基于后缀树的精简数据结构(没有后缀链接(慢速错误匹配),但模式匹配非常快)。
The trie is not for real world use because it consumes too much space.
trie 不适用于现实世界,因为它占用了太多空间。
The suffix tree is lighter and faster than the trie and is used to index DNA or optimize some large web search engines.
后缀树比 trie 更轻、更快,用于索引 DNA 或优化一些大型网络搜索引擎。
The suffix array is slower in some pattern searches than the suffix tree but uses less space, and is more widely used than the Suffix tree.
后缀数组在某些模式搜索中比后缀树慢,但使用的空间更少,并且比后缀树使用更广泛。
In the same family of data structures:
在同一族数据结构中:
There are other implementations, the CST is an implementation of the suffix tree using a suffix array and some additional data structures to get some of the suffix tree search capabilities.
还有其他的实现,CST是使用后缀数组和一些额外的数据结构来实现一些后缀树搜索能力的后缀树的实现。
The FCST takes it further, it implements a sampled suffix tree with a suffix array.
FCST 更进一步,它实现了一个带有后缀数组的采样后缀树。
The DFCST is a dynamic version of the FCST.
DFCST 是 FCST 的动态版本。
Expanding:
扩展:
The two important factors are space use and operation execution time. You might think that with modern day machines this is not relevant but to index the DNA of a single human being would require 40 gigabytes of memory (using an uncompressed and unoptimized suffix tree). And to build one of this indexes over this much data can take days. Imagine Google, it has lots of searchable data, they need a large index over all web data and they do not change it every time someone builds a web page. They have some form of caching for that. However the main index is probably static. And every couple of weeks or so they gather all new web sites and data and build a new index, replacing the old when the new is finished. I do not know which algorithm they use to index, but it is probably a suffix array with suffix tree properties over a partitioned database.
两个重要的因素是空间使用和操作执行时间。您可能认为对于现代机器,这无关紧要,但索引一个人的 DNA 需要 40 GB 的内存(使用未压缩和未优化的后缀树)。并且在这么多数据上构建其中一个索引可能需要数天时间。想象一下谷歌,它有很多可搜索的数据,他们需要一个涵盖所有网络数据的大索引,而且每次有人构建网页时他们都不会改变它。他们有某种形式的缓存。然而,主索引可能是静态的。每隔几周左右,他们就会收集所有新的网站和数据并建立一个新的索引,在新的完成后替换旧的。我不知道他们使用哪种算法来索引,但它可能是分区数据库上具有后缀树属性的后缀数组。
The CST uses 8 gigabytes, however the suffix tree operations speed are heavily reduced.
CST 使用 8 GB,但是后缀树操作速度大大降低。
The suffix array can do the same in some 700 megas to 2 Gigas. However you will not find genetic errors in the DNA with a suffix array (meaning: searching for a pattern with a wildcard is much much slower).
后缀阵列可以在大约 700 兆到 2 兆的范围内执行相同的操作。但是,您不会在带有后缀数组的 DNA 中发现遗传错误(意思是:搜索带有通配符的模式要慢得多)。
The FCST (fully compressed suffix tree) can create a suffix tree in 800 to 1.5 gigas. With a rather small speed deterioration towards the CST.
FCST(全压缩后缀树)可以创建一个800到1.5千兆的后缀树。以相当小的速度向 CST 恶化。
The DFCST uses 20% more space than the FCST, and loses speed to the static implementation of the FCST (however a dynamic index is very important).
DFCST 使用的空间比 FCST 多 20%,并且比 FCST 的静态实现速度慢(但是动态索引非常重要)。
There are not many viable (space wise) implementations of the suffix tree because it is very hard to make the operations speed boost compensate the data structures RAM space cost.
后缀树的可行(空间明智)实现并不多,因为很难使操作速度提升补偿数据结构 RAM 空间成本。
This said, the suffix tree has very interesting search results for pattern matching with errors. The aho corasick is not as fast (though nearly as fast for some operations, not error matching) and the boyer moore is left in the dust.
这就是说,后缀树对于带有错误的模式匹配具有非常有趣的搜索结果。aho corasick 没有那么快(尽管对于某些操作来说几乎一样快,而不是错误匹配)并且 boyer moore 被留在了尘土中。
回答by swestrup
Using Suffix Trees you can write something that will match your dictionary to your text in O(n+m+k) time where n is letters in your dictionary, m is letters in your text, and k is the number of matches. Tries are much slower for this. I'm not sure what a Suffix Array is, so I can't comment on that.
使用后缀树,您可以在 O(n+m+k) 时间内编写一些将字典与文本匹配的内容,其中 n 是字典中的字母,m 是文本中的字母,k 是匹配的数量。为此,尝试要慢得多。我不确定后缀数组是什么,所以我不能对此发表评论。
That said, it's non-trivial to code and I don't happen to know of any Java libraries that provide the necessary functions.
也就是说,编写代码很重要,而且我不知道有任何提供必要功能的 Java 库。
回答by Matthew Slattery
EDIT: In this case I want to make string matching between a large dictionary of names and a large set of natural language texts, in order to identify the names of the dictionary on texts.
编辑:在这种情况下,我想在大型名称词典和大型自然语言文本集之间进行字符串匹配,以识别文本词典的名称。
This sounds like an application for the Aho-Corasick algorithm: construct an automaton from the dictionary (in linear time), which can then be used to find all the occurrences of any of the dictionary words in multiple texts (also in linear time).
这听起来像是Aho-Corasick 算法的应用程序:从字典(在线性时间内)构造一个自动机,然后可以使用它来查找多个文本中任何字典单词的所有出现次数(也在线性时间内)。
(The description in these lecture notes, linked from the "External links" section of the Wikipedia page, is a lot easier to read than the description on the page itself.)
(这些讲义中的描述,从维基百科页面的“外部链接”部分链接,比页面本身的描述更容易阅读。)
回答by Chad Brewbaker
What operations do you plan on doing? libdivsufsortwas at one time the best suffix array implementation in C.
您计划进行哪些操作? libdivsufsort曾经是 C 中最好的后缀数组实现。
回答by hexcoder
回答by ibra
Trie vs Suffix tree
Trie vs 后缀树
both data structure ensure a very fast look up, the time of search is proportional to the lenght of the query word, complexity time O(m) where m is the lenght of the query word.
两种数据结构都确保了非常快速的查找,搜索时间与查询词的长度成正比,复杂度时间为 O(m),其中 m 是查询词的长度。
it's mean if we have query word that have 10 chars, so we need at most 10 steps to find it.
这意味着如果我们有一个有 10 个字符的查询词,那么我们最多需要 10 个步骤才能找到它。
Trie: A tree for storing strings in which there is one node for every common prefix. The strings are stored in extra leaf nodes.
Trie:一种用于存储字符串的树,其中每个公共前缀都有一个节点。字符串存储在额外的叶节点中。
suffix tree: A compact representation of a trie corresponding to the suffixes of a given string where all nodes with one child are merged with their parents.
后缀树:与给定字符串的后缀对应的 trie 的紧凑表示,其中具有一个子节点的所有节点都与其父节点合并。
def are from : Dictionary of Algorithms and Data Structures
def来自:算法和数据结构词典
generally Trie used to index dictionary words (lexicon) or any sets of strings example D={abcd, abcdd, bxcdf,.....,zzzz }
通常 Trie 用于索引字典单词(lexicon)或任何字符串集,例如 D={abcd, abcdd, bxcdf,.....,zzzz }
a suffix tree used to index text by using the same data structure "Trie" on all suffixes of our text T=abcdabcg all suffixes of T = {abcdabcg , abcdabc , abcdab, abcda, abcd, abc , ab, a}
一个后缀树,用于通过在我们文本的所有后缀上使用相同的数据结构“Trie”来索引文本 T=abcdabcg T = {abcdabcg , abcdabc , abcdab, abcda, abcd, abc , ab, a} 的所有后缀
now it look like a groups of strings. we build a Trie over over this groups of strings (all suffixes of T).
现在它看起来像一组字符串。我们在这组字符串(T 的所有后缀)上构建了一个 Trie。
the construction of both data structure is in linear, it take O(n) in time and space.
两种数据结构的构建都是线性的,时间和空间都需要O(n)。
in case of dicionary (a set of strings): n = the sum of the characters of all the words. in text : n = length of text.
如果是字典(一组字符串):n = 所有单词的字符总和。在文本中:n = 文本长度。
suffix array : is a technic to represent a suffix tree in compressed sapce, it's an array of all starting positions of suffixes of a string.
后缀数组:是一种在压缩空间中表示后缀树的技术,它是一个字符串后缀的所有起始位置的数组。
it's slower than suffix tree in search time.
它在搜索时间上比后缀树慢。
for more information go to wikipedia , there is a good article talking on this topic.
有关更多信息,请访问维基百科,有一篇关于此主题的好文章。
回答by Fogsail Chen
I prefer Suffix Auto Machine. You can find more details through my website: http://www.fogsail.net/2019/03/06/20190306/
我更喜欢后缀自动机。你可以通过我的网站找到更多细节:http: //www.fogsail.net/2019/03/06/20190306/
first, If you used normal construction, it will takes O(n^2) to travel all the suffix
首先,如果您使用正常构造,则需要 O(n^2) 来遍历所有后缀
We use radix-sort to sort the suffix Array by first character.
我们使用 radix-sort 按第一个字符对后缀 Array 进行排序。
But, if we sort the first character, we can use the information.
但是,如果我们对第一个字符进行排序,我们就可以使用这些信息。
Details are showed by the images (neglect Chinese)
细节由图片展示(忽略中文)
We sort array by the first-key, the result is presented by the red rectangle
我们按第一个键对数组进行排序,结果由红色矩形表示
,,....??>,,....
,,....??>,,....
#include <iostream>
#include <cstdio>
#include <vector>
#include <queue>
#include <cstring>
#include <algorithm>
using namespace std;
const int maxn = 1001 * 100 + 10;
struct suffixArray {
int str[maxn], sa[maxn], rank[maxn], lcp[maxn];
int c[maxn], t1[maxn], t2[maxn];
int n;
void init() { n = 0; memset(sa, 0, sizeof(sa)); }
void buildSa(int Rdx) {
int i, *x = t1, *y = t2;
for(i = 0; i < Rdx; i++) c[i] = 0;
for(i = 0; i < n; i++) c[x[i] = str[i]]++;
for(i = 1; i < Rdx; i++) c[i] += c[i-1];
for(i = n-1; i >= 0; i--) sa[--c[x[i]]] = i;
for(int offset = 1; offset <= n; offset <<= 1) {
int p = 0;
for(i = n-offset; i < n; i++) y[p++] = i;
for(i = 0; i < n; i++) if(sa[i] >= offset) y[p++] = sa[i] - offset;
// radix sort
for(i = 0; i < Rdx; i++) c[i] = 0;
for(i = 0; i < n; i++) c[x[y[i]]]++;
for(i = 1; i < Rdx; i++) c[i] += c[i-1];
for(i = n-1; i >= 0; i--) { sa[--c[x[y[i]]]] = y[i]; y[i] = 0; }
// rebuild x and y
swap(x, y); x[sa[0]] = 0; p = 1;
for(i = 1; i < n; i++)
x[sa[i]] = y[sa[i-1]] == y[sa[i]] && y[sa[i-1]+offset] == y[sa[i]+offset] ? p-1 : p++;
if(p >= n) break;
Rdx = p;
}
}