string Ukkonen 的简单英文后缀树算法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9452701/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 01:24:38  来源:igfitidea点击:

Ukkonen's suffix tree algorithm in plain English

stringalgorithmlanguage-agnosticsuffix-tree

提问by Nathan Ridley

I feel a bit thick at this point. I've spent days trying to fully wrap my head around suffix tree construction, but because I don't have a mathematical background, many of the explanations elude me as they start to make excessive use of mathematical symbology. The closest to a good explanation that I've found is Fast String Searching With Suffix Trees, but he glosses over various points and some aspects of the algorithm remain unclear.

这时候我觉得有点厚。我花了几天的时间试图完全了解后缀树的构建,但是因为我没有数学背景,许多解释都让我无法理解,因为他们开始过度使用数学符号。我发现的最接近一个好的解释是Fast String Searching With Suffix Trees,但他掩盖了各个点,算法的某些方面仍然不清楚。

A step-by-step explanation of this algorithm here on Stack Overflow would be invaluable for many others besides me, I'm sure.

我敢肯定,除了我之外,Stack Overflow 上对这个算法的逐步解释对于许多其他人来说都是无价的。

For reference, here's Ukkonen's paper on the algorithm: http://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf

作为参考,这里是 Ukkonen 关于算法的论文:http: //www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf

My basic understanding, so far:

到目前为止,我的基本理解:

  • I need to iterate through each prefix P of a given string T
  • I need to iterate through each suffix S in prefix P and add that to tree
  • To add suffix S to the tree, I need to iterate through each character in S, with the iterations consisting of either walking down an existing branch that starts with the same set of characters C in S and potentially splitting an edge into descendent nodes when I reach a differing character in the suffix, OR if there was no matching edge to walk down. When no matching edge is found to walk down for C, a new leaf edge is created for C.
  • 我需要遍历给定字符串 T 的每个前缀 P
  • 我需要遍历前缀 P 中的每个后缀 S 并将其添加到树中
  • 要将后缀 S 添加到树中,我需要遍历 S 中的每个字符,迭代包括沿着以 S 中相同字符集 C 开始的现有分支走下去,并可能在我将边拆分为后代节点时到达后缀中的不同字符,或者如果没有匹配的边缘走下去。当没有找到匹配 C 的边时,会为 C 创建一个新的叶边。

The basic algorithm appears to be O(n2), as is pointed out in most explanations, as we need to step through all of the prefixes, then we need to step through each of the suffixes for each prefix. Ukkonen's algorithm is apparently unique because of the suffix pointer technique he uses, though I think thatis what I'm having trouble understanding.

基本算法似乎是 O(n 2),正如大多数解释中所指出的,因为我们需要遍历所有前缀,然后我们需要遍历每个前缀的每个后缀。Ukkonen 的算法显然是独一无二的,因为他使用了后缀指针技术,尽管我认为是我难以理解的。

I'm also having trouble understanding:

我也很难理解:

  • exactly when and how the "active point" is assigned, used and changed
  • what is going on with the canonization aspect of the algorithm
  • Why the implementations I've seen need to "fix" bounding variables that they are using
  • 确切地何时以及如何分配、使用和更改“活动点”
  • 算法的规范化方面发生了什么
  • 为什么我看到的实现需要“修复”他们正在使用的边界变量


Here is the completed C#source code. It not only works correctly, but supports automatic canonization and renders a nicer looking text graph of the output. Source code and sample output is at:

这是完整的C#源代码。它不仅工作正常,而且支持自动规范化并呈现出更好看的输出文本图。源代码和示例输出位于:

https://gist.github.com/2373868

https://gist.github.com/2373868



Update 2017-11-04

更新 2017-11-04

After many years I've found a new use for suffix trees, and have implemented the algorithm in JavaScript. Gist is below. It should be bug-free. Dump it into a js file, npm install chalkfrom the same location, and then run with node.js to see some colourful output. There's a stripped down version in the same Gist, without any of the debugging code.

多年后,我发现了后缀树的新用途,并在JavaScript 中实现了该算法。要点如下。它应该是没有错误的。npm install chalk从同一位置将其转储到 js 文件中,然后使用 node.js 运行以查看一些彩色输出。在同一个 Gist 中有一个精简版,没有任何调试代码。

https://gist.github.com/axefrog/c347bf0f5e0723cbd09b1aaed6ec6fc6

https://gist.github.com/axefrog/c347bf0f5e0723cbd09b1aaed6ec6fc6

回答by jogojapan

The following is an attempt to describe the Ukkonen algorithm by first showing what it does when the string is simple (i.e. does not contain any repeated characters), and then extending it to the full algorithm.

下面尝试描述 Ukkonen 算法,首先展示它在字符串简单(即不包含任何重复字符)时的作用,然后将其扩展到完整算法。

First, a few preliminary statements.

首先,做一些初步的陈述。

  1. What we are building, is basicallylike a search trie. So there is a root node, edges going out of it leading to new nodes, and further edges going out of those, and so forth

  2. But: Unlike in a search trie, the edge labels are not single characters. Instead, each edge is labeled using a pair of integers [from,to]. These are pointers into the text. In this sense, each edge carries a string label of arbitrary length, but takes only O(1) space (two pointers).

  1. 我们正在构建的东西基本上就像一个搜索树。所以有一个根节点,从它出来的边通向新节点,还有更多的边从那些节点出来,依此类推

  2. 但是:与搜索树不同,边缘标签不是单个字符。相反,每条边都使用一对整数标记 [from,to]。这些是指向文本的指针。从这个意义上说,每条边都带有一个任意长度的字符串标签,但只占用 O(1) 空间(两个指针)。

Basic principle

基本原则

I would like to first demonstrate how to create the suffix tree of a particularly simple string, a string with no repeated characters:

我想先演示一下如何创建一个特别简单的字符串的后缀树,一个没有重复字符的字符串:

abc

The algorithm works in steps, from left to right. There is one step for every character of the string. Each step might involve more than one individual operation, but we will see (see the final observations at the end) that the total number of operations is O(n).

该算法分步工作,从左到右。有一步到位的字符串的每个字符。每个步骤可能涉及多个单独的操作,但我们将看到(见最后的最终观察)操作总数为 O(n)。

So, we start from the left, and first insert only the single character aby creating an edge from the root node (on the left) to a leaf, and labeling it as [0,#], which means the edge represents the substring starting at position 0 and ending at the current end. I use the symbol #to mean the current end, which is at position 1 (right after a).

所以,我们从左边开始,首先a通过创建从根节点(在左边)到叶子的边,然后将其标记为[0,#],只插入单个字符 ,这意味着边表示从位置 0 开始到结束的子串在当前结束。我使用符号#来表示当前结束,它位于位置 1(紧接在 之后a)。

So we have an initial tree, which looks like this:

所以我们有一个初始树,它看起来像这样:

And what it means is this:

它的意思是这样的:

Now we progress to position 2 (right after b). Our goal at each stepis to insert all suffixes up to the current position. We do this by

现在我们前进到位置 2(就在 之后b)。我们在每一步的目标是插入到当前位置的所有后缀。我们这样做

  • expanding the existing a-edge to ab
  • inserting one new edge for b
  • 将现有的a边扩展到ab
  • 插入一条新边 b

In our representation this looks like

在我们的表示中,这看起来像

enter image description here

在此处输入图片说明

And what it means is:

它的意思是:

We observetwo things:

我们观察到两件事:

  • The edge representation for abis the sameas it used to be in the initial tree: [0,#]. Its meaning has automatically changed because we updated the current position #from 1 to 2.
  • Each edge consumes O(1) space, because it consists of only two pointers into the text, regardless of how many characters it represents.
  • 对于边缘的表示ab相同的,因为它使用的是在初始树:[0,#]。由于我们将当前位置#从 1更新为 2,它的含义已自动更改。
  • 每条边都消耗 O(1) 空间,因为它只包含两个指向文本的指针,而不管它代表多少个字符。

Next we increment the position again and update the tree by appending a cto every existing edge and inserting one new edge for the new suffix c.

接下来,我们再次增加位置并通过将 a 附加c到每个现有边并为新后缀插入一条新边来更新树c

In our representation this looks like

在我们的表示中,这看起来像

And what it means is:

它的意思是:

We observe:

我们观察到:

  • The tree is the correct suffix tree up to the current positionafter each step
  • There are as many steps as there are characters in the text
  • The amount of work in each step is O(1), because all existing edges are updated automatically by incrementing #, and inserting the one new edge for the final character can be done in O(1) time. Hence for a string of length n, only O(n) time is required.
  • 树是 每一步后到当前位置的正确后缀树
  • 步骤与文本中的字符一样多
  • 每一步的工作量是 O(1),因为所有现有的边都会通过递增自动更新#,并且可以在 O(1) 时间内为最终字符插入一条新边。因此对于长度为 n 的字符串,只需要 O(n) 时间。

First extension: Simple repetitions

第一个扩展:简单的重复

Of course this works so nicely only because our string does not contain any repetitions. We now look at a more realistic string:

当然,这很有效只是因为我们的字符串不包含任何重复。我们现在看一个更现实的字符串:

abcabxabcd

It starts with abcas in the previous example, then abis repeated and followed by x, and then abcis repeated followed by d.

它以abc前面的示例中的开头,然后ab重复,然后是,然后是重复x,然后abcd

Steps 1 through 3:After the first 3 steps we have the tree from the previous example:

第 1 步到第 3 步在前 3 个步骤之后,我们得到了上一个示例中的树:

Step 4:We move #to position 4. This implicitly updates all existing edges to this:

第 4 步:我们移动#到位置 4。这将所有现有的边隐式更新为:

and we need to insert the final suffix of the current step, a, at the root.

并且我们需要a在根处插入当前步骤的最终后缀。

Before we do this, we introduce two more variables(in addition to #), which of course have been there all the time but we haven't used them so far:

在我们这样做之前,我们再引入两个变量(除了 #),它们当然一直存在,但到目前为止我们还没有使用它们:

  • The active point, which is a triple (active_node,active_edge,active_length)
  • The remainder, which is an integer indicating how many new suffixes we need to insert
  • 活性点,这是一个三 (active_node,active_edge,active_length)
  • remainder,这是表示我们有多少新后缀需要插入一个整数

The exact meaning of these two will become clear soon, but for now let's just say:

这两个的确切含义很快就会变得清晰,但现在让我们先说:

  • In the simple abcexample, the active point was always (root,'\0x',0), i.e. active_nodewas the root node, active_edgewas specified as the null character '\0x', and active_lengthwas zero. The effect of this was that the one new edge that we inserted in every step was inserted at the root node as a freshly created edge. We will see soon why a triple is necessary to represent this information.
  • The remainderwas always set to 1 at the beginning of each step. The meaning of this was that the number of suffixes we had to actively insert at the end of each step was 1 (always just the final character).
  • 在这个简单的abc例子中,活动点总是 (root,'\0x',0),即active_node根节点,active_edge被指定为空字符'\0x',并且active_length为零。这样做的结果是,我们在每一步中插入的一条新边作为新创建的边插入到根节点。我们很快就会看到为什么需要一个三元组来表示这些信息。
  • remainder每一步的开始总是设置为1。这意味着我们必须在每一步结束时主动插入的后缀数为 1(始终只是最后一个字符)。

Now this is going to change. When we insert the current final character aat the root, we notice that there is already an outgoing edge starting with a, specifically: abca. Here is what we do in such a case:

现在这将改变。当我们a在根处插入当前的最后一个字符时,我们注意到已经有一个以 开头的输出边a,特别是:abca。在这种情况下,我们会这样做:

  • We do notinsert a fresh edge [4,#]at the root node. Instead we simply notice that the suffix ais already in our tree. It ends in the middle of a longer edge, but we are not bothered by that. We just leave things the way they are.
  • We set the active pointto (root,'a',1). That means the active point is now somewhere in the middle of outgoing edge of the root node that starts with a, specifically, after position 1 on that edge. We notice that the edge is specified simply by its first character a. That suffices because there can be only oneedge starting with any particular character (confirm that this is true after reading through the entire description).
  • We also increment remainder, so at the beginning of the next step it will be 2.
  • 我们不会[4,#]在根节点插入新边。相反,我们只是注意到后缀a已经在我们的树中。它在较长边缘的中间结束,但我们并不担心。我们只是让事情保持原样。
  • 我们将活动点设置(root,'a',1)。这意味着活动点现在位于以 开头的根节点的出边中间的某个a位置,特别是在该边上的位置 1 之后。我们注意到边由它的第一个字符简单地指定a。这足够了,因为只能有一个以任何特定字符开头的边(在阅读整个描述后确认这是真的)。
  • 我们也增加了remainder,所以在下一步开始时它将是 2。

Observation:When the final suffix we need to insert is found to exist in the tree already, the tree itself is not changedat all (we only update the active point and remainder). The tree is then not an accurate representation of the suffix tree up to the current positionany more, but it containsall suffixes (because the final suffix ais contained implicitly). Hence, apart from updating the variables (which are all of fixed length, so this is O(1)), there was no workdone in this step.

观察:发现我们需要插入的最终后缀已经存在于树中时,树本身根本没有改变(我们只更新活动点和remainder)。该树不再是直到当前位置的后缀树的准确表示,但它包含所有后缀(因为最终后缀a隐式包含)。因此,除了更新变量(都是固定长度的,所以这是 O(1)), 这一步没有任何工作

Step 5:We update the current position #to 5. This automatically updates the tree to this:

第 5 步:我们将当前位置更新#为 5。这会自动将树更新为:

And because remainderis 2, we need to insert two final suffixes of the current position: aband b. This is basically because:

并且因为remainder是 2,我们需要插入当前位置的两个最终后缀:abb。这主要是因为:

  • The asuffix from the previous step has never been properly inserted. So it has remained, and since we have progressed one step, it has now grown from ato ab.
  • And we need to insert the new final edge b.
  • a上一步的后缀从未正确插入。所以它一直存在,自从我们进步了一步,它现在已经从aab
  • 我们需要插入新的最终边b

In practice this means that we go to the active point (which points to behind the aon what is now the abcabedge), and insert the current final character b. But:Again, it turns out that bis also already present on that same edge.

在实践中,这意味着我们转到活动点(指向a现在abcab边缘的后面),并插入当前的最终字符b但是:同样,事实证明b也已经存在于同一边缘。

So, again, we do not change the tree. We simply:

所以,同样,我们不改变树。我们简单地:

  • Update the active point to (root,'a',2)(same node and edge as before, but now we point to behind the b)
  • Increment the remainderto 3 because we still have not properly inserted the final edge from the previous step, and we don't insert the current final edge either.
  • 将活动点更新为(root,'a',2)(与以前相同的节点和边,但现在我们指向 后面b
  • 将 theremainder增加到 3,因为我们仍然没有正确插入上一步中的最后一条边,而且我们也没有插入当前的最后一条边。

To be clear: We had to insert aband bin the current step, but because abwas already found, we updated the active point and did not even attempt to insert b. Why? Because if abis in the tree, every suffixof it (including b) must be in the tree, too. Perhaps only implicitly, but it must be there, because of the way we have built the tree so far.

需要明确的是:我们必须在当前步骤中插入abb,但因为ab已经找到,我们更新了活动点,甚至没有尝试插入b。为什么?因为如果ab在树中, 它的每个后缀(包括b)也必须在树中。也许只是隐含的,但它必须存在,因为到目前为止我们构建树的方式。

We proceed to step 6by incrementing #. The tree is automatically updated to:

我们通过递增进行第 6 步#。树会自动更新为:

Because remainderis 3, we have to insert abx, bxand x. The active point tells us where abends, so we only need to jump there and insert the x. Indeed, xis not there yet, so we split the abcabxedge and insert an internal node:

因为remainder是 3,我们必须插入abx,bxx。活动点告诉我们在哪里ab结束,所以我们只需要跳到那里并插入x. 确实,x还没有,所以我们拆分abcabx边并插入一个内部节点:

The edge representations are still pointers into the text, so splitting and inserting an internal node can be done in O(1) time.

边表示仍然是指向文本的指针,因此可以在 O(1) 时间内完成拆分和插入内部节点。

So we have dealt with abxand decrement remainderto 2. Now we need to insert the next remaining suffix, bx. But before we do that we need to update the active point. The rule for this, after splitting and inserting an edge, will be called Rule 1below, and it applies whenever the active_nodeis root (we will learn rule 3 for other cases further below). Here is rule 1:

所以我们已经处理abx并递减remainder到2。现在我们需要插入下一个剩余的后缀,bx。但在我们这样做之前,我们需要更新活动点。在分割和插入一条边之后,此规则将在下面称为规则 1,它适用于任何时候 active_node是根(我们将在下面进一步学习其他情况下的规则 3)。这是规则1:

After an insertion from root,

  • active_noderemains root
  • active_edgeis set to the first character of the new suffix we need to insert, i.e. b
  • active_lengthis reduced by 1

从根插入后,

  • active_node仍然是根
  • active_edge设置为我们需要插入的新后缀的第一个字符,即 b
  • active_length减少 1

Hence, the new active-point triple (root,'b',1)indicates that the next insert has to be made at the bcabxedge, behind 1 character, i.e. behind b. We can identify the insertion point in O(1) time and check whether xis already present or not. If it was present, we would end the current step and leave everything the way it is. But xis not present, so we insert it by splitting the edge:

因此,新的活动点三元组(root,'b',1)指示必须在bcabx边缘进行下一次插入,在 1 个字符之后,即在 之后b。我们可以在 O(1) 时间内识别插入点并检查是否x已经存在。如果它存在,我们将结束当前步骤并保持原样。但是x不存在,所以我们通过分割边缘来插入它:

Again, this took O(1) time and we update remainderto 1 and the active point to (root,'x',0)as rule 1 states.

同样,这花费了 O(1) 时间,我们更新remainder为 1,并将活动点更新(root,'x',0)为规则 1 状态。

But there is one more thing we need to do. We'll call this Rule 2:

但是我们还需要做一件事。我们称之为规则 2:

If we split an edge and insert a new node, and if that is not the first nodecreated during the current step, we connect the previously inserted node and the new node through a special pointer, a suffix link. We will later see why that is useful. Here is what we get, the suffix link is represented as a dotted edge:

如果我们拆分一条边并插入一个新节点,并且如果它不是当前步骤中创建的第一个节点,则我们通过一个特殊的指针,即后缀链接将先前插入的节点和新节点连接起来。我们稍后会看到为什么这很有用。这是我们得到的,后缀链接用虚线表示:

We still need to insert the final suffix of the current step, x. Since the active_lengthcomponent of the active node has fallen to 0, the final insert is made at the root directly. Since there is no outgoing edge at the root node starting with x, we insert a new edge:

我们仍然需要插入当前步骤的最终后缀, x。由于active_length活动节点的分量已经降到了0,直接在根处进行最后的插入。由于在以 开头的根节点处没有出边x,我们插入一条新边:

As we can see, in the current step all remaining inserts were made.

正如我们所见,在当前步骤中,所有剩余的插入物都已制作完成。

We proceed to step 7by setting #=7, which automatically appends the next character, a, to all leaf edges, as always. Then we attempt to insert the new final character to the active point (the root), and find that it is there already. So we end the current step without inserting anything and update the active point to (root,'a',1).

我们通过设置=7继续第 7 步#,它会a像往常一样自动将下一个字符 , 附加 到所有叶边缘。然后我们尝试将新的最终字符插入活动点(根),发现它已经在那里了。所以我们在不插入任何东西的情况下结束当前步骤并将活动点更新为(root,'a',1)

In step 8, #=8, we append b, and as seen before, this only means we update the active point to (root,'a',2)and increment remainderwithout doing anything else, because bis already present. However,we notice (in O(1) time) that the active point is now at the end of an edge. We reflect this by re-setting it to (node1,'\0x',0). Here, I use node1to refer to the internal node the abedge ends at.

第 8 步#=8,我们追加b,并且如前所述,这仅意味着我们将活动点更新为(root,'a',2)并递增remainder而不做任何其他事情,因为b已经存在。但是,我们注意到(在 O(1) 时间内)活动点现在位于边的末端。我们通过将其重新设置为 来反映这一点 (node1,'\0x',0)。在这里,我node1用来指代ab边缘结束的内部节点。

Then, in step #=9, we need to insert 'c' and this will help us to understand the final trick:

然后,在步骤#=9 中,我们需要插入 'c',这将有助于我们理解最后的技巧:

Second extension: Using suffix links

第二个扩展:使用后缀链接

As always, the #update appends cautomatically to the leaf edges and we go to the active point to see if we can insert 'c'. It turns out 'c' exists already at that edge, so we set the active point to (node1,'c',1), increment remainderand do nothing else.

与往常一样,#更新会c自动附加到叶子边缘,然后我们转到活动点以查看是否可以插入“c”。事实证明 'c' 已经存在于那个边缘,所以我们将活动点设置为 (node1,'c',1),递增remainder并且不做任何其他事情。

Now in step #=10, remainderis 4, and so we first need to insert abcd(which remains from 3 steps ago) by inserting dat the active point.

现在在step #=10 中remainder是 4,因此我们首先需要abcd通过d在活动点插入来插入 (从 3 个步骤之前保留)。

Attempting to insert dat the active point causes an edge split in O(1) time:

尝试d在活动点插入会导致 O(1) 时间内的边缘分裂:

The active_node, from which the split was initiated, is marked in red above. Here is the final rule, Rule 3:

active_node,从该分割被发起的,标记为红色的上方。这是最终规则,规则 3:

After splitting an edge from an active_nodethat is not the root node, we follow the suffix link going out of that node, if there is any, and reset the active_nodeto the node it points to. If there is no suffix link, we set the active_nodeto the root. active_edgeand active_lengthremain unchanged.

在从active_node不是根节点的一个边分裂出一条边之后,我们沿着离开该节点的后缀链接(如果有的话)并将 重置为active_node它指向的节点。如果没有后缀链接,我们设置active_node为根。active_edgeactive_length保持不变。

So the active point is now (node2,'c',1), and node2is marked in red below:

所以活动点现在是(node2,'c',1),并node2在下面标记为红色:

Since the insertion of abcdis complete, we decrement remainderto 3 and consider the next remaining suffix of the current step, bcd. Rule 3 has set the active point to just the right node and edge so inserting bcdcan be done by simply inserting its final character dat the active point.

由于 的插入abcd完成,我们递减remainder到 3 并考虑当前步骤的下一个剩余后缀 bcd。规则 3 将活动点设置为正确的节点和边,因此bcd只需d在活动点插入其最后一个字符即可完成插入 。

Doing this causes another edge split, and because of rule 2, we must create a suffix link from the previously inserted node to the new one:

这样做会导致另一个边分裂,并且由于规则 2,我们必须创建一个从先前插入的节点到新节点的后缀链接:

We observe:Suffix links enable us to reset the active point so we can make the next remaining insertat O(1) effort. Look at the graph above to confirm that indeed node at label abis linked to the node at b(its suffix), and the node at abcis linked to bc.

我们观察到:后缀链接使我们能够重置活动点,以便我们可以以 O(1) 的努力进行下一个剩余的插入。查看上图以确认标签ab处的节点确实链接到节点处b(其后缀),并且节点处abc链接到 bc

The current step is not finished yet. remainderis now 2, and we need to follow rule 3 to reset the active point again. Since the current active_node(red above) has no suffix link, we reset to root. The active point is now (root,'c',1).

当前步骤尚未完成。remainder现在是 2,我们需要按照规则 3 再次重置活动点。由于当前active_node(上面红色)没有后缀链接,我们重置为root。活动点现在是(root,'c',1)

Hence the next insert occurs at the one outgoing edge of the root node whose label starts with c: cabxabcd, behind the first character, i.e. behind c. This causes another split:

因此,下一次插入发生在标签以c:开头的根节点的一个传出边上cabxabcd,在第一个字符之后,即在 之后c。这导致了另一个分裂:

And since this involves the creation of a new internal node,we follow rule 2 and set a new suffix link from the previously created internal node:

由于这涉及到创建一个新的内部节点,我们遵循规则 2 并从之前创建的内部节点设置一个新的后缀链接:

(I am using Graphviz Dotfor these little graphs. The new suffix link caused dot to re-arrange the existing edges, so check carefully to confirm that the only thing that was inserted above is a new suffix link.)

(我正在为这些小图使用Graphviz Dot。新的后缀链接导致 dot 重新排列现有的边,因此请仔细检查以确认上面插入的唯一内容是新的后缀链接。)

With this, remaindercan be set to 1 and since the active_nodeis root, we use rule 1 to update the active point to (root,'d',0). This means the final insert of the current step is to insert a single dat root:

有了这个,remainder可以设置为 1 并且由于active_node是根,我们使用规则 1 将活动点更新为(root,'d',0)。这意味着当前步骤的最终插入是d在根处插入一个:

That was the final step and we are done. There are number of final observations, though:

这是最后一步,我们完成了。不过,有一些最终观察结果

  • In each step we move #forward by 1 position. This automatically updates all leaf nodes in O(1) time.

  • But it does not deal with a) any suffixes remainingfrom previous steps, and b) with the one final character of the current step.

  • remaindertells us how many additional inserts we need to make. These inserts correspond one-to-one to the final suffixes of the string that ends at the current position #. We consider one after the other and make the insert. Important:Each insert is done in O(1) time since the active point tells us exactly where to go, and we need to add only one single character at the active point. Why? Because the other characters are contained implicitly(otherwise the active point would not be where it is).

  • After each such insert, we decrement remainderand follow the suffix link if there is any. If not we go to root (rule 3). If we are at root already, we modify the active point using rule 1. In any case, it takes only O(1) time.

  • If, during one of these inserts, we find that the character we want to insert is already there, we don't do anything and end the current step, even if remainder>0. The reason is that any inserts that remain will be suffixes of the one we just tried to make. Hence they are all implicitin the current tree. The fact that remainder>0 makes sure we deal with the remaining suffixes later.

  • What if at the end of the algorithm remainder>0? This will be the case whenever the end of the text is a substring that occurred somewhere before. In that case we must append one extra character at the end of the string that has not occurred before. In the literature, usually the dollar sign $is used as a symbol for that. Why does that matter?--> If later we use the completed suffix tree to search for suffixes, we must accept matches only if they end at a leaf. Otherwise we would get a lot of spurious matches, because there are manystrings implicitlycontained in the tree that are not actual suffixes of the main string. Forcing remainderto be 0 at the end is essentially a way to ensure that all suffixes end at a leaf node. However,if we want to use the tree to search for general substrings, not only suffixesof the main string, this final step is indeed not required, as suggested by the OP's comment below.

  • So what is the complexity of the entire algorithm? If the text is n characters in length, there are obviously n steps (or n+1 if we add the dollar sign). In each step we either do nothing (other than updating the variables), or we make remainderinserts, each taking O(1) time. Since remainderindicates how many times we have done nothing in previous steps, and is decremented for every insert that we make now, the total number of times we do something is exactly n (or n+1). Hence, the total complexity is O(n).

  • However, there is one small thing that I did not properly explain: It can happen that we follow a suffix link, update the active point, and then find that its active_lengthcomponent does not work well with the new active_node. For example, consider a situation like this:

  • 在每一步中,我们向前移动#1 个位置。这会在 O(1) 时间内自动更新所有叶节点。

  • 但它不处理 a)前一步骤剩余的任何后缀,以及 b) 当前步骤的最后一个字符。

  • remainder告诉我们需要制作多少个额外的插件。这些插入与在当前位置结束的字符串的最终后缀一一对应#。我们一个接一个地考虑并插入。重要提示:每次插入都在 O(1) 时间内完成,因为活动点告诉我们确切的去向,我们只需要在活动点添加一个字符。为什么?因为其他字符是隐式包含的(否则活动点将不在它所在的位置)。

  • 在每次这样的插入之后,我们递减remainder并遵循后缀链接(如果有的话)。如果不是,我们去root(规则3)。如果我们已经在 root,我们使用规则 1 修改活动点。无论如何,它只需要 O(1) 时间。

  • 如果在其中一个插入过程中,我们发现要插入的字符已经存在,我们不做任何事情并结束当前步骤,即使remainder>0。原因是任何剩余的插入都将是我们刚刚尝试制作的插入的后缀。因此它们都隐含在当前树中。remainder>0的事实确保我们稍后处理剩余的后缀。

  • 如果在算法结束时remainder>0 怎么办?每当文本的结尾是之前某处出现的子字符串时,就会出现这种情况。在这种情况下,我们必须在字符串的末尾附加一个以前没有出现过的额外字符。在文献中,通常使用美元符号$作为符号。为什么这很重要?--> 如果稍后我们使用完整的后缀树来搜索后缀,我们必须仅接受以叶子结尾的匹配项。否则,我们就得到了很多虚假的比赛,因为有很多的字符串隐含包含在树不属于主串的实际后缀。强迫remainder最后为 0 本质上是一种确保所有后缀都以叶节点结尾的方法。但是,如果我们想使用树来搜索一般子字符串,而不仅仅是主字符串的后缀,那么这最后一步确实不是必需的,正如下面 OP 的评论所建议的那样。

  • 那么整个算法的复杂度是多少呢?如果文本长度为 n 个字符,则显然有 n 个步骤(如果添加美元符号,则为 n+1)。在每一步中,我们要么什么都不做(除了更新变量),要么进行remainder插入,每次都花费 O(1) 时间。由于remainder表示我们在前面的步骤中有多少次什么都没做,并且随着我们现在每次插入而递减,我们做某事的总次数正好是 n(或 n+1)。因此,总复杂度为 O(n)。

  • 但是,有一点我没有正确解释:可能会发生我们按照后缀链接,更新活动点,然后发现它的active_length组件不能很好地与新的active_node. 例如,考虑这样的情况:

(The dashed lines indicate the rest of the tree. The dotted line is a suffix link.)

(虚线表示树的其余部分。虚线是后缀链接。)

Now let the active point be (red,'d',3), so it points to the place behind the fon the defgedge. Now assume we made the necessary updates and now follow the suffix link to update the active point according to rule 3. The new active point is (green,'d',3). However, the d-edge going out of the green node is de, so it has only 2 characters. In order to find the correct active point, we obviously need to follow that edge to the blue node and reset to (blue,'f',1).

现在,让我们主动点(red,'d',3),使其指向身后的位置f上的defg优势。现在假设我们进行了必要的更新,现在按照后缀链接根据规则 3 更新活动点。新的活动点是(green,'d',3)。但是,d从绿色节点出来的-edge 是de,所以它只有 2 个字符。为了找到正确的活动点,我们显然需要沿着这条边到达蓝色节点并重置为(blue,'f',1)

In a particularly bad case, the active_lengthcould be as large as remainder, which can be as large as n. And it might very well happen that to find the correct active point, we need not only jump over one internal node, but perhaps many, up to n in the worst case. Does that mean the algorithm has a hidden O(n2) complexity, because in each step remainderis generally O(n), and the post-adjustments to the active node after following a suffix link could be O(n), too?

在特别糟糕的情况下,active_length可能大到 remainder,也可能大到 n。很可能要找到正确的活动点,我们不仅需要跳过一个内部节点,而且可能需要跳过许多,在最坏的情况下最多可达 n。这是否意味着该算法具有隐藏的 O(n 2) 复杂度,因为在每个步骤remainder中通常为 O(n),并且在遵循后缀链接后对活动节点的后调整也可能为 O(n)?

No. The reason is that if indeed we have to adjust the active point (e.g. from green to blue as above), that brings us to a new node that has its own suffix link, and active_lengthwill be reduced. As we follow down the chain of suffix links we make the remaining inserts, active_lengthcan only decrease, and the number of active-point adjustments we can make on the way can't be larger than active_lengthat any given time. Since active_lengthcan never be larger than remainder, and remainderis O(n) not only in every single step, but the total sum of increments ever made to remainderover the course of the entire process is O(n) too, the number of active point adjustments is also bounded by O(n).

不。原因是,如果我们确实必须调整活动点(例如,从绿色到蓝色,如上所述),这会将我们带到一个具有自己的后缀链接的新节点,并且active_length会减少。当我们沿着后缀链接链向下时,我们会进行剩余的插入,active_length只能减少,并且我们在途中可以进行的活动点调整的数量不能超过active_length任何给定时间。由于 active_length永远不会大于remainder,并且remainder不仅在每个步骤中都是 O(n),而且remainder在整个过程中所做的增量总和也是 O(n),因此活动点调整的数量是也以 O(n) 为界。

回答by makagonov

I tried to implement the Suffix Tree with the approach given in jogojapan's answer, but it didn't work for some cases due to wording used for the rules. Moreover, I've mentioned that nobody managed to implement an absolutely correct suffix tree using this approach. Below I will write an "overview" of jogojapan's answer with some modifications to the rules. I will also describe the case when we forget to create importantsuffix links.

我尝试使用 jogojapan 的答案中给出的方法来实现后缀树,但由于用于规则的措辞,它在某些情况下不起作用。此外,我已经提到没有人能够使用这种方法实现绝对正确的后缀树。下面我将写一篇关于 jogojapan 答案的“概述”,并对规则进行一些修改。我还将描述我们忘记创建重要后缀链接的情况。

Additional variables used

使用的附加变量

  1. active point- a triple (active_node; active_edge; active_length), showing from where we must start inserting a new suffix.
  2. remainder- shows the number of suffixes we must add explicitly. For instance, if our word is 'abcaabca', and remainder = 3, it means we must process 3 last suffixes: bca, caand a.
  1. 活动点- 三元组 (active_node; active_edge; active_length),显示我们必须从哪里开始插入新后缀。
  2. 余数- 显示我们必须明确添加的后缀数。例如,如果我们的单词是 'abcaabca',并且余数 = 3,这意味着我们必须处理 3 个最后的后缀:bcacaa

Let's use a concept of an internal node- all the nodes, except the rootand the leafsare internal nodes.

让我们使用内部节点的概念——除根之外的所有节点都是内部节点

Observation 1

观察 1

When the final suffix we need to insert is found to exist in the tree already, the tree itself is not changed at all (we only update the active pointand remainder).

当发现我们需要插入的最终后缀已经存在于树中时,树本身根本没有改变(我们只更新active pointremainder)。

Observation 2

观察2

If at some point active_lengthis greater or equal to the length of current edge (edge_length), we move our active pointdown until edge_lengthis strictly greater than active_length.

如果在某个点active_length大于或等于当前边的长度 ( edge_length),我们active point向下移动直到edge_length严格大于active_length

Now, let's redefine the rules:

现在,让我们重新定义规则:

Rule 1

规则1

If after an insertion from the active node= root, the active lengthis greater than 0, then:

  1. active nodeis not changed
  2. active lengthis decremented
  3. active edgeis shifted right (to the first character of the next suffix we must insert)

如果从活动节点插入后= root活动长度大于 0,则:

  1. 活动节点没有改变
  2. 活动长度递减
  3. 活动边缘向右移动(到我们必须插入的下一个后缀的第一个字符)

Rule 2

规则 2

If we create a new internal nodeORmake an inserter from an internal node, and this is not the first SUCHinternal nodeat current step, then we link the previous SUCHnode with THISone through a suffix link.

如果我们创建一个新的内部节点从一个内部节点制作一个插入器,并且这不是当前步骤的第一个SUCH内部节点,那么我们通过后缀链接将前一个SUCH节点与这个节点链接起来

This definition of the Rule 2is different from jogojapan', as here we take into account not only the newly createdinternal nodes, but also the internal nodes, from which we make an insertion.

的这个定义Rule 2与 jogojapan' 不同,因为在这里我们不仅考虑新创建的内部节点,还考虑我们进行插入的内部节点。

Rule 3

规则 3

After an insert from the active nodewhich is not the rootnode, we must follow the suffix link and set the active nodeto the node it points to. If there is no a suffix link, set the active nodeto the rootnode. Either way, active edgeand active lengthstay unchanged.

从插入后活动节点是不是节点,我们必须按照后缀链路和设置主动节点所指向的节点。如果不存在一个链路后缀,设置活动节点节点。无论哪种方式,活动边活动长度保持不变。

In this definition of Rule 3we also consider the inserts of leaf nodes (not only split-nodes).

在这个定义中,Rule 3我们还考虑了叶节点的插入(不仅是分裂节点)。

And finally, Observation 3:

最后,观察 3:

When the symbol we want to add to the tree is already on the edge, we, according to Observation 1, update only active pointand remainder, leaving the tree unchanged. BUTif there is an internal nodemarked as needing suffix link, we must connect that node with our current active nodethrough a suffix link.

当我们想要添加到树中的符号已经在边上时,我们根据Observation 1,只更新active pointremainder,保持树不变。但是如果有一个内部节点标记为需要后缀链接,我们必须active node通过后缀链接将该节点与我们的当前节点连接起来。

Let's look at the example of a suffix tree for cdddcdcif we add a suffix link in such case and if we don't:

如果我们在这种情况下添加后缀链接,如果我们不添加后缀链接,让我们看一下cdddcdc后缀树的示例:

  1. If we DON'Tconnect the nodes through a suffix link:

    • before adding the last letter c:

    • after adding the last letter c:

  2. If we DOconnect the nodes through a suffix link:

    • before adding the last letter c:

    • after adding the last letter c:

  1. 如果我们通过后缀链接连接节点:

    • 在添加最后一个字母c 之前

    • 添加最后一个字母c 后

  2. 如果我们确实通过后缀链接连接节点:

    • 在添加最后一个字母c 之前

    • 添加最后一个字母c 后

Seems like there is no significant difference: in the second case there are two more suffix links. But these suffix links are correct, and one of them - from the blue node to the red one - is very importantfor our approach with active point. The problem is that if we don't put a suffix link here, later, when we add some new letters to the tree, we might omit adding some nodes to the tree due to the Rule 3, because, according to it, if there's no a suffix link, then we must put the active_nodeto the root.

似乎没有显着差异:在第二种情况下,还有两个后缀链接。但是这些后缀链接是正确的,其中之一——从蓝色节点到红色节点——对于我们使用活动点的方法非常重要。问题是,如果我们不在这里放置后缀链接,稍后当我们向树添加一些新字母时,由于 ,我们可能会省略向树添加一些节点,因为根据它,如果没有后缀链接,那么我们必须把它放到根目录下。Rule 3active_node

When we were adding the last letter to the tree, the red node had already existedbefore we made an insert from the blue node (the edge labled 'c'). As there was an insert from the blue node, we mark it as needing a suffix link. Then, relying on the active pointapproach, the active nodewas set to the red node. But we don't make an insert from the red node, as the letter 'c'is already on the edge. Does it mean that the blue node must be left without a suffix link? No, we must connect the blue node with the red one through a suffix link. Why is it correct? Because the active pointapproach guarantees that we get to a right place, i.e., to the next place where we must process an insert of a shortersuffix.

当我们将最后一个字母添加到树中时,在我们从蓝色节点(标记为'c'的边)插入之前,红色节点已经存在。由于有来自蓝色节点的插入,我们将其标记为需要后缀链接。然后,依靠活动点方法,将设置为红色节点。但是我们不会从红色节点插入,因为字母“c”已经在边缘。是不是意味着蓝色节点必须没有后缀链接?不,我们必须通过后缀链接将蓝色节点与红色节点连接起来。为什么是正确的?因为活动点active node方法保证我们到达正确的位置,即到达我们必须处理较短后缀的插入的下一个位置。

Finally, here are my implementations of the Suffix Tree:

最后,这是我对后缀树的实现:

  1. Java
  2. C++
  1. 爪哇
  2. C++

Hope that this "overview" combined with jogojapan's detailed answer will help somebody to implement his own Suffix Tree.

希望这个“概述”与 jogojapan 的详细答案相结合将有助于某人实现他自己的后缀树。

回答by mutux

Thanks for the well explained tutorial by @jogojapan, I implemented the algorithm in Python.

感谢@jogojapan解释得很好的教程,我在 Python 中实现了该算法。

A couple of minor problems mentioned by @jogojapan turns out to be more sophisticatedthan I have expected, and need to be treated very carefully. It cost me several days to get my implementation robust enough(I suppose). Problems and solutions are listed below:

@jogojapan 提到的几个小问题比我预期的要复杂,需要非常小心地对待。我花了几天时间才使我的实现足够健壮(我想)。问题及解决方法如下:

  1. End with Remainder > 0It turns out this situation can also happen during the unfolding step, not just the end of the entire algorithm. When that happens, we can leave the remainder, actnode, actedge, and actlength unchanged, end the current unfolding step, and start another step by either keep folding or unfolding depending on if the next char in the original string is on the current path or not.

  2. Leap Over Nodes:When we follow a suffix link, update the active point, and then find that its active_length component does not work well with the new active_node. We have to move forwardto the right place to split, or insert a leaf. This process might be not that straightforwardbecause during the moving the actlength and actedge keep changing all the way, when you have to move back to the root node, the actedgeand actlengthcould be wrongbecause of those moves. We need additional variable(s) to keep that information.

    enter image description here

  1. End withRemainder > 0事实证明,这种情况也可能发生在展开步骤期间,而不仅仅是整个算法的结尾。当发生这种情况时,我们可以保持余数、actnode、actedge 和 actlength不变,结束当前的展开步骤,并根据原始字符串中的下一个字符是否在当前路径上或继续折叠或展开来开始另一个步骤。不是。

  2. Leap Over Nodes:当我们沿着一个后缀链接,更新活动点,然后发现它的active_length分量与新的active_node不能很好的配合。我们必须向前移动到正确的地方进行分裂,或者插入一片叶子​​。这个过程可能不是那么简单,因为在移动过程中 actlength 和 actedge 一直在变化,当你必须移回根节点时,由于这些移动,actedgeactlength可能是错误的。我们需要额外的变量来保存这些信息。

    在此处输入图片说明

The other two problems have somehow been pointed out by @managonov

@managonov以某种方式指出了另外两个问题

  1. Split Could DegenerateWhen trying to split an edge, sometime you'll find the split operation is right on a node. That case we only need add a new leaf to that node, take it as a standard edge split operation, which means the suffix links if there's any, should be maintained correspondingly.

  2. Hidden Suffix LinksThere is another special case which is incurred by problem 1and problem 2. Sometimes we need to hop over several nodes to the right point for split, we might surpassthe right point if we move by comparing the remainder string and the path labels. That case the suffix link will be neglected unintentionally, if there should be any. This could be avoided by remembering the right pointwhen moving forward. The suffix link should be maintained if the split node already exists, or even the problem 1happens during a unfolding step.

  1. 拆分可能退化当尝试拆分一条边时,有时您会发现拆分操作正好在节点上。这种情况我们只需要给那个节点添加一个新的叶子,把它作为一个标准的边分裂操作,这意味着如果有后缀链接,应该相应地维护。

  2. 隐藏的后缀链接问题 1问题 2导致了另一种特殊情况。有时我们需要跳过几个节点到正确的点进行分割,如果我们通过比较剩余的字符串和路径标签来移动,我们可能会超过正确的点。在这种情况下,后缀链接将被无意中忽略,如果应该有的话。这可以通过在前进时记住正确的点来避免。如果分裂节点已经存在,则应该保持后缀链接,或者甚至在展开步骤中出现问题1

Finally, my implementation in Pythonis as follows:

最后,我在Python中的实现如下:

Tips:It includes a naive tree printingfunction in the code above, which is very important while debugging. It saved me a lot of time and is convenient for locating special cases.

提示:在上面的代码中包含了一个朴素的树打印功能,这在调试时非常重要。它为我节省了大量时间,并且方便定位特殊情况。

回答by MrValueType

Apologies if my answer seems redundant, but I implemented Ukkonen's algorithm recently, and found myself struggling with it for days; I had to read through multiple papers on the subject to understand the why and how of some core aspects of the algorithm.

如果我的回答看起来多余,我深表歉意,但我最近实施了 Ukkonen 的算法,发现自己为此苦苦挣扎了好几天;我不得不通读关于这个主题的多篇论文,以了解算法某些核心方面的原因和方式。

I found the 'rules' approach of previous answers unhelpful for understanding the underlying reasons, so I've written everything below focusing solely on the pragmatics. If you've struggled with following other explanations, just like I did, perhaps my supplemental explanation will make it 'click' for you.

我发现以前答案的“规则”方法对理解根本原因无济于事,所以我在下面写了所有内容,只关注语用学。如果您一直在努力遵循其他解释,就像我一样,也许我的补充解释会让您“点击”。

I published my C# implementation here: https://github.com/baratgabor/SuffixTree

我在这里发布了我的 C# 实现:https: //github.com/baratgabor/SuffixTree

Please note that I'm not an expert on this subject, so the following sections may contain inaccuracies (or worse). If you encounter any, feel free to edit.

请注意,我不是这方面的专家,因此以下部分可能包含不准确的地方(或更糟)。如果您遇到任何问题,请随时编辑。

Prerequisites

先决条件

The starting point of the following explanation assumes you're familiar with the content and use of suffix trees, and the characteristics of Ukkonen's algorithm, e.g. how you're extending the suffix tree character by character, from start to end. Basically, I assume you've read some of the other explanations already.

以下解释的起点假设您熟悉后缀树的内容和使用,以及 Ukkonen 算法的特征,例如您如何从头到尾逐个字符地扩展后缀树。基本上,我假设您已经阅读了其他一些解释。

(However, I did have to add some basic narrative for the flow, so the beginning might indeed feel redundant.)

(但是,我确实必须为流程添加一些基本的叙述,因此开头可能确实会让人觉得多余。)

The most interesting part is the explanation on the difference between using suffix links and rescanning from the root. This is what gave me a lot of bugs and headaches in my implementation.

最有趣的部分是解释使用后缀链接和从根重新扫描之间的区别。这就是在我的实现中给我带来很多错误和头痛的原因。

Open-ended leaf nodes and their limitations

开放式叶节点及其局限性

I'm sure you already know that the most fundamental 'trick' is to realize we can just leave the end of the suffixes 'open', i.e. referencing the current length of the string instead of setting the end to a static value. This way when we add additional characters, those characters will be implicitly added to all suffix labels, without having to visit and update all of them.

我相信您已经知道最基本的“技巧”是意识到我们可以将后缀的结尾保留为“开放”,即引用字符串的当前长度而不是将结尾设置为静态值。这样,当我们添加其他字符时,这些字符将被隐式添加到所有后缀标签中,而无需访问和更新所有后缀标签。

But this open ending of suffixes – for obvious reasons – works only for nodes that represent the end of the string, i.e. the leaf nodes in the tree structure. The branching operations we execute on the tree (the addition of new branch nodes and leaf nodes) won't propagate automatically everywhere they need to.

但是这种后缀的开放式结尾——出于显而易见的原因——仅适用于代表字符串结尾的节点,即树结构中的叶节点。我们在树上执行的分支操作(添加新的分支节点和叶节点)不会自动传播到他们需要的任何地方。

It's probably elementary, and wouldn't require mention, that repeated substrings don't appear explicitly in the tree, since the tree already contains these by virtue of them being repetitions; however, when the repetitive substring ends by encountering a non-repeating character, we need to create a branching at that point to represent the divergence from that point onwards.

这可能是基本的,不需要提及,重复的子串没有明确地出现在树中,因为树已经包含这些,因为它们是重复的;但是,当重复子串以遇到非重复字符而结束时,我们需要在该点创建一个分支以表示从该点开始的分歧。

For example in case of the string 'ABCXABCY'(see below), a branching to Xand Yneeds to be added to three different suffixes, ABC, BCand C; otherwise it wouldn't be a valid suffix tree, and we couldn't find all substrings of the string by matching characters from the root downwards.

例如,对于字符串'ABCXABCY'(见下文),需要将XY的分支添加到三个不同的后缀ABCBCC;否则它就不是一个有效的后缀树,我们无法通过从根向下匹配字符来找到字符串的所有子字符串。

Once again, to emphasize – anyoperation we execute on a suffix in the tree needs to be reflected by its consecutive suffixes as well (e.g. ABC > BC > C), otherwise they simply cease to be valid suffixes.

再次强调 -我们对树中的后缀执行的任何操作也需要通过其连续后缀来反映(例如 ABC > BC > C),否则它们就不再是有效的后缀。

Repeating branching in suffixes

在后缀中重复分支

But even if we accept that we have to do these manual updates, how do we know how many suffixes need to be updated? Since, when we add the repeated character A(and the rest of the repeated characters in succession), we have no idea yet when/where do we need to split the suffix into two branches. The need to split is ascertained only when we encounter the first non-repeating character, in this case Y(instead of the Xthat already exists in the tree).

但即使我们接受必须进行这些手动更新,我们如何知道需要更新多少个后缀?因为,当我们添加重复的字符A(以及连续的其余重复字符)时,我们还不知道何时/何地需要将后缀分成两个分支。只有当我们遇到第一个非重复字符时才确定是否需要拆分,在这种情况下是Y(而不是树中已经存在的X)。

What we can do is to match the longest repeated string we can, and count how many of its suffixes we need to update later. This is what 'remainder'stands for.

我们能做的就是匹配我们能匹配到的最长的重复字符串,并计算我们后面需要更新多少个后缀。这就是“剩余”的含义。

The concept of 'remainder' and 'rescanning'

“剩余”和“重新扫描”的概念

The variable remaindertells us how many repeated characters we added implicitly, without branching; i.e. how many suffixes we need to visit to repeat the branching operation once we found the first character that we cannot match. This essentially equals to how many characters 'deep' we are in the tree from its root.

变量remainder告诉我们隐式添加了多少重复字符,没有分支;即一旦我们找到第一个无法匹配的字符,我们需要访问多少个后缀来重复分支操作。这基本上等于我们在树中从根开始“深”多少个字符。

So, staying with the previous example of the string ABCXABCY, we match the repeated ABCpart 'implicitly', incrementing remaindereach time, which results in remainder of 3. Then we encounter the non-repeating character 'Y'. Here we split the previously added ABCXinto ABC->Xand ABC->Y. Then we decrement remainderfrom 3 to 2, because we already took care of the ABCbranching. Now we repeat the operation by matching the last 2 characters – BC– from the root to reach the point where we need to split, and we split BCXtoo into BC->Xand BC->Y. Again, we decrement remainderto 1, and repeat the operation; until the remainderis 0. Lastly, we need to add the current character (Y) itself to the root as well.

因此,继续前面的字符串ABCXABCY 示例,我们“隐式”匹配重复的ABC部分,remainder每次递增,结果为 3 的余数。然后我们遇到非重复字符'Y'。这里我们将之前添加的ABCX 拆分ABC-> XABC-> Y。然后我们remainder从 3减少到 2,因为我们已经处理了ABC分支。现在我们通过匹配从根到我们需要拆分的点的最后 2 个字符BC 来重复操作,我们也将BCX拆分为BC-> XBC-> Y。再次,我们递减remainder到 1,并重复操作;直到remainder为 0。最后,我们还需要将当前字符 ( Y) 本身添加到根中。

This operation, following the consecutive suffixes from the root simply to reach the point where we need to do an operation is what's called 'rescanning'in Ukkonen's algorithm, and typically this is the most expensive part of the algorithm. Imagine a longer string where you need to 'rescan' long substrings, across many dozens of nodes (we'll discuss this later), potentially thousands of times.

在 Ukkonen 算法中,此操作遵循从根开始的连续后缀,只是为了到达我们需要执行操作的点,这就是所谓的“重新扫描”,通常这是算法中最昂贵的部分。想象一个更长的字符串,您需要在其中“重新扫描”长子字符串,跨越许多节点(我们将在稍后讨论),可能是数千次。

As a solution, we introduce what we call 'suffix links'.

作为解决方案,我们引入了我们所说的“后缀链接”

The concept of 'suffix links'

“后缀链接”的概念

Suffix links basically point to the positions we'd normally have to 'rescan'to, so instead of the expensive rescan operation we can simply jump to the linked position, do our work, jump to the next linked position, and repeat – until there are no more positions to update.

后缀链接基本上指向我们通常必须“重新扫描”到的位置,因此我们可以简单地跳转到链接位置,完成我们的工作,跳转到下一个链接位置,然后重复——直到那里没有更多的职位要更新。

Of course one big question is how to add these links. The existing answer is that we can add the links when we insert new branch nodes, utilizing the fact that, in each extension of the tree, the branch nodes are naturally created one after another in the exact order we'd need to link them together. Though, we have to link from the last created branch node (the longest suffix) to the previously created one, so we need to cache the last we create, link that to the next one we create, and cache the newly created one.

当然,一个大问题是如何添加这些链接。现有的答案是,我们可以在插入新的分支节点时添加链接,利用这样一个事实,即在树的每个扩展中,分支节点自然会按照我们需要将它们链接在一起的确切顺序一个接一个地创建. 但是,我们必须从最后创建的分支节点(最长的后缀)链接到先前创建的分支节点,因此我们需要缓存我们创建的最后一个,将其链接到我们创建的下一个,并缓存新创建的一个。

One consequence is that we actually often don't have suffix links to follow, because the given branch node was just created. In these cases we have to still fall back to the aforementioned 'rescanning'from root. This is why, after an insertion, you're instructed to either use the suffix link, or jump to root.

一个后果是我们实际上经常没有后缀链接可以跟随,因为给定的分支节点刚刚创建。在这些情况下,我们仍然必须回到前面提到从根开始的“重新扫描”。这就是为什么在插入后,您会被指示使用后缀链接或跳转到根目录。

(Or alternatively, if you're storing parent pointers in the nodes, you can try to follow the parents, check if they have a link, and use that. I found that this is very rarely mentioned, but the suffix link usage is not set in stones.There are multiple possible approaches, and if you understand the underlying mechanism you can implement one that fits your needs the best.)

(或者,如果你在节点中存储父指针,你可以尝试跟随父节点,检查它们是否有链接,然后使用它。我发现很少提到这个,但后缀链接的用法不是一成不变。有多种可能的方法,如果您了解底层机制,您可以实施最适合您需求的方法。)

The concept of 'active point'

“活动点”的概念

So far we discussed multiple efficient tools for building the tree, and vaguely referred to traversing over multiple edges and nodes, but haven't yet explored the corresponding consequences and complexities.

到目前为止,我们讨论了用于构建树的多种有效工具,并模糊地提到了遍历多个边和节点,但还没有探讨相应的后果和复杂性。

The previously explained concept of 'remainder'is useful for keeping track where we are in the tree, but we have to realize it doesn't store enough information.

前面解释的“剩余”概念对于跟踪我们在树中的位置很有用,但我们必须意识到它没有存储足够的信息。

Firstly, we always reside on a specific edge of a node, so we need to store the edge information. We shall call this 'active edge'.

首先,我们总是驻留在节点的特定边上,所以我们需要存储边信息。我们将称之为“活动边缘”

Secondly, even after adding the edge information, we still have no way to identify a position that is farther down in the tree, and not directly connected to the rootnode. So we need to store the node as well. Let's call this 'active node'.

其次,即使添加了边缘信息后,我们仍然没有办法确定一个位置,在树越往下,而不是直接连接到节点。所以我们也需要存储节点。我们称之为“活动节点”

Lastly, we can notice that the 'remainder'is inadequate to identify a position on an edge that is not directly connected to root, because 'remainder'is the length of the entire route; and we probably don't want to bother with remembering and subtracting the length of the previous edges. So we need a representation that is essentially the remainder on the current edge. This is what we call 'active length'.

最后,我们可以注意到“剩余部分”不足以识别不直接连接到根的边上的位置,因为“剩余部分”是整个路线的长度;我们可能不想费心记住和减去之前边的长度。所以我们需要一个本质上是当前边上余数的表示。这就是我们所说的“活动长度”

This leads to what we call 'active point'– a package of three variables that contain all the information we need to maintain about our position in the tree:

这导致了我们所说的“活动点”——一个包含三个变量的包,其中包含我们需要维护的关于我们在树中的位置的所有信息:

Active Point = (Active Node, Active Edge, Active Length)

Active Point = (Active Node, Active Edge, Active Length)

You can observe on the following image how the matched route of ABCABDconsists of 2 characters on the edge AB(from root), plus 4 characters on the edge CABDABCABD(from node 4) – resulting in a 'remainder'of 6 characters. So, our current position can be identified as Active Node 4, Active Edge C, Active Length 4.

您可以在下图中观察到ABCABD的匹配路由如何由边AB(来自)上的 2 个字符,加上边CABDABCABD(来自节点 4)上的 4 个字符组成- 导致6 个字符的“余数”。因此,我们当前的位置可以识别为Active Node 4, Active Edge C, Active Length 4

Remainder and Active Point

剩余和活动点

Another important role of the 'active point'is that it provides an abstraction layer for our algorithm, meaning that parts of our algorithm can do their work on the 'active point', irrespective of whether that active point is in the root or anywhere else. This makes it easy to implement the use of suffix links in our algorithm in a clean and straight-forward way.

“活动点”的另一个重要作用是它为我们的算法提供了一个抽象层,这意味着我们的算法的各个部分都可以在“活动点”上进行工作,而不管该活动点是在根中还是在其他任何地方. 这使得在我们的算法中以干净和直接的方式实现后缀链接的使用变得容易。

Differences of rescanning vs using suffix links

重新扫描与使用后缀链接的差异

Now, the tricky part, something that – in my experience – can cause plenty of bugs and headaches, and is poorly explained in most sources, is the difference in processing the suffix link cases vs the rescan cases.

现在,棘手的部分是——根据我的经验——会导致大量错误和头痛,并且在大多数来源中解释得很差,是处理后缀链接案例与重新扫描案例的差异。

Consider the following example of the string 'AAAABAAAABAAC':

考虑以下字符串'AAAABAAAABAAC' 的示例:

Remainder across multiple edges

跨多个边的余数

You can observe above how the 'remainder'of 7 corresponds to the total sum of characters from root, while 'active length'of 4 corresponds to the sum of matched characters from the active edge of the active node.

您可以在上面观察到7的“余数”如何对应于来自根的字符总和,而 4 的“活动长度”对应于来自活动节点的活动边缘的匹配字符的总和。

Now, after executing a branching operation at the active point, our active node might or might not contain a suffix link.

现在,在活动点执行分支操作后,我们的活动节点可能包含也可能不包含后缀链接。

If a suffix link is present:We only need to process the 'active length'portion. The 'remainder'is irrelevant, because the node where we jump to via the suffix link already encodes the correct 'remainder' implicitly, simply by virtue of being in the tree where it is.

如果存在后缀链接:我们只需要处理“活动长度”部分。该“其余部分,是无关紧要的,因为在这里我们通过后缀链接跳转到该节点已编码正确的“余”含蓄,只是凭借在树枝上,它是。

If a suffix link is NOT present:We need to 'rescan'from zero/root, which means processing the whole suffix from the beginning. To this end we have to use the whole 'remainder'as the basis of rescanning.

如果后缀链接不存在:我们需要从零/根开始“重新扫描”,这意味着从头开始处理整个后缀。为此,我们必须使用整个“剩余部分”作为重新扫描的基础。

Example comparison of processing with and without a suffix link

带后缀链接和不带后缀链接的处理示例比较

Consider what happens at the next step of the example above. Let's compare how to achieve the same result – i.e. moving to the next suffix to process – with and without a suffix link.

考虑在上述示例的下一步中会发生什么。让我们比较一下如何实现相同的结果——即移动到下一个要处理的后缀——使用和不使用后缀链接。

Using 'suffix link'

使用“后缀链接”

Reaching consecutive suffixes via suffix links

通过后缀链接到达连续的后缀

Notice that if we use a suffix link, we are automatically 'at the right place'. Which is often not strictly true due to the fact that the 'active length'can be 'incompatible' with the new position.

请注意,如果我们使用后缀链接,我们会自动“在正确的位置”。由于“活动长度”可能与新位置“不兼容”这一事实,这通常不是严格正确的。

In the case above, since the 'active length'is 4, we're working with the suffix 'ABAA', starting at the linked Node 4. But after finding the edge that corresponds to the first character of the suffix ('A'), we notice that our 'active length'overflows this edge by 3 characters. So we jump over the full edge, to the next node, and decrement 'active length'by the characters we consumed with the jump.

在上面的例子中,由于“活动长度”是 4,我们使用后缀“ ABAA”,从链接的节点 4 开始。但是在找到与后缀的第一个字符(“A”)相对应的边之后),我们注意到我们的“活动长度”在这条边上溢出了 3 个字符。所以我们跳过整个边,到下一个节点,并通过我们在跳跃中消耗的字符减少“活动长度”

Then, after we found the next edge 'B', corresponding to the decremented suffix 'BAA', we finally note that the edge length is larger than the remaining 'active length'of 3, which means we found the right place.

然后,在我们找到下一条边'B' 后,对应于递减的后缀'BAA',我们终于注意到边长大于剩余的'active length'3,这意味着我们找到了正确的位置。

Please note that it seems this operation is usually not referred to as 'rescanning', even though to me it seems it's the direct equivalent of rescanning, just with a shortened length and a non-root starting point.

请注意,此操作似乎通常不被称为“重新扫描”,尽管在我看来它似乎是重新扫描的直接等价物,只是长度缩短且起点为非根。

Using 'rescan'

使用“重新扫描”

Reaching consecutive suffixes via rescanning

通过重新扫描达到连续的后缀

Notice that if we use a traditional 'rescan' operation (here pretending we didn't have a suffix link), we start at the top of the tree, at root, and we have to work our way down again to the right place, following along the entire length of the current suffix.

请注意,如果我们使用传统的“重新扫描”操作(这里假设我们没有后缀链接),我们从树的顶部开始,从根开始,我们必须再次向下移动到正确的位置,跟随当前后缀的整个长度。

The length of this suffix is the 'remainder'we discussed before. We have to consume the entirety of this remainder, until it reaches zero. This might (and often does) include jumping through multiple nodes, at each jump decreasing the remainder by the length of the edge we jumped through. Then finally, we reach an edge that is longer than our remaining 'remainder'; here we set the active edge to the given edge, set 'active length'to remaining 'remainder', and we're done.

这个后缀的长度就是我们之前讨论过的“余数”。我们必须消耗掉整个剩余部分,直到它达到零。这可能(并且经常)包括跳过多个节点,在每次跳转时,将剩余部分减少我们跳过的边的长度。最后,我们到达一个比我们剩余的“剩余部分”更长的边;在这里,我们将活动边设置为给定的边,将“活动长度”设置为剩余的“剩余”,我们就完成了。

Note, however, that the actual 'remainder'variable needs to be preserved, and only decremented after each node insertion. So what I described above assumed using a separate variable initialized to 'remainder'.

但是请注意,实际的“剩余”变量需要保留,并且仅在每个节点插入后递减。所以我上面描述的假设使用一个单独的变量初始化为'remainder'

Notes on suffix links & rescans

关于后缀链接和重新扫描的说明

1) Notice that both methods lead to the same result. Suffix link jumping is, however, significantly faster in most cases; that's the whole rationale behind suffix links.

1) 请注意,这两种方法都会导致相同的结果。然而,在大多数情况下,后缀链接跳转要快得多;这就是后缀链接背后的全部原理。

2) The actual algorithmic implementations don't need to differ. As I mentioned above, even in the case of using the suffix link, the 'active length'is often not compatible with the linked position, since that branch of the tree might contain additional branching. So essentially you just have to use 'active length'instead of 'remainder', and execute the same rescanning logic until you find an edge that is shorter than your remaining suffix length.

2) 实际的算法实现不需要不同。正如我上面提到的,即使在使用后缀链接的情况下,“活动长度”通常与链接位置不兼容,因为树的那个分支可能包含额外的分支。所以基本上你只需要使用'active length'而不是'remainder',并执行相同的重新扫描逻辑,直到找到比剩余后缀长度短的边。

3) One important remark pertaining to performance is that there is no need to check each and every character during rescanning. Due to the way a valid suffix tree is built, we can safely assume that the characters match. So you're mostly counting the lengths, and the only need for character equivalence checking arises when we jump to a new edge, since edges are identified by their first character (which is always unique in the context of a given node). This means that 'rescanning' logic is different than full string matching logic (i.e. searching for a substring in the tree).

3) 关于性能的一个重要说明是在重新扫描期间无需检查每个字符。由于构建有效后缀树的方式,我们可以安全地假设字符匹配。所以你主要是计算长度,当我们跳到新边时,唯一需要进行字符等效性检查,因为边是由它们的第一个字符标识的(在给定节点的上下文中它总是唯一的)。这意味着“重新扫描”逻辑不同于完整字符串匹配逻辑(即在树中搜索子字符串)。

4) The original suffix linking described here is just one of the possible approaches. For example NJ Larsson et al.names this approach as Node-Oriented Top-Down, and compares it to Node-Oriented Bottom-Upand two Edge-Orientedvarieties. The different approaches have different typical and worst case performances, requirements, limitations, etc., but it generally seems that Edge-Orientedapproaches are an overall improvement to the original.

4) 这里描述的原始后缀链接只是可能的方法之一。例如NJ Larsson 等人。将此方法命名为Node-Oriented Top-Down,并将其与Node-Oriented Bottom-Up和两个Edge-Oriented变体进行比较。不同的方法具有不同的典型和最坏情况的性能、要求、限制等,但通常看来,面向边缘的方法是对原始方法的全面改进。

回答by Kamil

@jogojapan you brought awesome explanation and visualisation. But as @makagonov mentioned it's missing some rules regarding setting suffix links. It's visible in nice way when going step by step on http://brenden.github.io/ukkonen-animation/through word 'aabaaabb'. When you go from step 10 to step 11, there is no suffix link from node 5 to node 2 but active point suddenly moves there.

@jogojapan 你带来了很棒的解释和可视化。但正如@makagonov 提到的,它缺少一些关于设置后缀链接的规则。在http://brenden.github.io/ukkonen-animation/ 上通过单词“aabaaabb”一步一步地进行时,它会以很好的方式可见。当你从第10步到第11步时,节点5到节点2没有后缀链接,但活动点突然移动到那里。

@makagonov since I live in Java world I also tried to follow your implementation to grasp ST building workflow but it was hard to me because of:

@makagonov 自从我生活在 Java 世界以来,我也尝试按照您的实现来掌握 ST 构建工作流程,但对我来说很难,因为:

  • combining edges with nodes
  • using index pointers instead of references
  • breaks statements;
  • continue statements;
  • 结合边与节点
  • 使用索引指针代替引用
  • 中断语句;
  • 继续声明;

So I ended up with such implementation in Java which I hope reflects all steps in clearer way and will reduce learning time for other Java people:

所以我最终在 Java 中实现了这样的实现,我希望它以更清晰的方式反映所有步骤,并减少其他 Java 人员的学习时间:

import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;

public class ST {

  public class Node {
    private final int id;
    private final Map<Character, Edge> edges;
    private Node slink;

    public Node(final int id) {
        this.id = id;
        this.edges = new HashMap<>();
    }

    public void setSlink(final Node slink) {
        this.slink = slink;
    }

    public Map<Character, Edge> getEdges() {
        return this.edges;
    }

    public Node getSlink() {
        return this.slink;
    }

    public String toString(final String word) {
        return new StringBuilder()
                .append("{")
                .append("\"id\"")
                .append(":")
                .append(this.id)
                .append(",")
                .append("\"slink\"")
                .append(":")
                .append(this.slink != null ? this.slink.id : null)
                .append(",")
                .append("\"edges\"")
                .append(":")
                .append(edgesToString(word))
                .append("}")
                .toString();
    }

    private StringBuilder edgesToString(final String word) {
        final StringBuilder edgesStringBuilder = new StringBuilder();
        edgesStringBuilder.append("{");
        for(final Map.Entry<Character, Edge> entry : this.edges.entrySet()) {
            edgesStringBuilder.append("\"")
                    .append(entry.getKey())
                    .append("\"")
                    .append(":")
                    .append(entry.getValue().toString(word))
                    .append(",");
        }
        if(!this.edges.isEmpty()) {
            edgesStringBuilder.deleteCharAt(edgesStringBuilder.length() - 1);
        }
        edgesStringBuilder.append("}");
        return edgesStringBuilder;
    }

    public boolean contains(final String word, final String suffix) {
        return !suffix.isEmpty()
                && this.edges.containsKey(suffix.charAt(0))
                && this.edges.get(suffix.charAt(0)).contains(word, suffix);
    }
  }

  public class Edge {
    private final int from;
    private final int to;
    private final Node next;

    public Edge(final int from, final int to, final Node next) {
        this.from = from;
        this.to = to;
        this.next = next;
    }

    public int getFrom() {
        return this.from;
    }

    public int getTo() {
        return this.to;
    }

    public Node getNext() {
        return this.next;
    }

    public int getLength() {
        return this.to - this.from;
    }

    public String toString(final String word) {
        return new StringBuilder()
                .append("{")
                .append("\"content\"")
                .append(":")
                .append("\"")
                .append(word.substring(this.from, this.to))
                .append("\"")
                .append(",")
                .append("\"next\"")
                .append(":")
                .append(this.next != null ? this.next.toString(word) : null)
                .append("}")
                .toString();
    }

    public boolean contains(final String word, final String suffix) {
        if(this.next == null) {
            return word.substring(this.from, this.to).equals(suffix);
        }
        return suffix.startsWith(word.substring(this.from,
                this.to)) && this.next.contains(word, suffix.substring(this.to - this.from));
    }
  }

  public class ActivePoint {
    private final Node activeNode;
    private final Character activeEdgeFirstCharacter;
    private final int activeLength;

    public ActivePoint(final Node activeNode,
                       final Character activeEdgeFirstCharacter,
                       final int activeLength) {
        this.activeNode = activeNode;
        this.activeEdgeFirstCharacter = activeEdgeFirstCharacter;
        this.activeLength = activeLength;
    }

    private Edge getActiveEdge() {
        return this.activeNode.getEdges().get(this.activeEdgeFirstCharacter);
    }

    public boolean pointsToActiveNode() {
        return this.activeLength == 0;
    }

    public boolean activeNodeIs(final Node node) {
        return this.activeNode == node;
    }

    public boolean activeNodeHasEdgeStartingWith(final char character) {
        return this.activeNode.getEdges().containsKey(character);
    }

    public boolean activeNodeHasSlink() {
        return this.activeNode.getSlink() != null;
    }

    public boolean pointsToOnActiveEdge(final String word, final char character) {
        return word.charAt(this.getActiveEdge().getFrom() + this.activeLength) == character;
    }

    public boolean pointsToTheEndOfActiveEdge() {
        return this.getActiveEdge().getLength() == this.activeLength;
    }

    public boolean pointsAfterTheEndOfActiveEdge() {
        return this.getActiveEdge().getLength() < this.activeLength;
    }

    public ActivePoint moveToEdgeStartingWithAndByOne(final char character) {
        return new ActivePoint(this.activeNode, character, 1);
    }

    public ActivePoint moveToNextNodeOfActiveEdge() {
        return new ActivePoint(this.getActiveEdge().getNext(), null, 0);
    }

    public ActivePoint moveToSlink() {
        return new ActivePoint(this.activeNode.getSlink(),
                this.activeEdgeFirstCharacter,
                this.activeLength);
    }

    public ActivePoint moveTo(final Node node) {
        return new ActivePoint(node, this.activeEdgeFirstCharacter, this.activeLength);
    }

    public ActivePoint moveByOneCharacter() {
        return new ActivePoint(this.activeNode,
                this.activeEdgeFirstCharacter,
                this.activeLength + 1);
    }

    public ActivePoint moveToEdgeStartingWithAndByActiveLengthMinusOne(final Node node,
                                                                       final char character) {
        return new ActivePoint(node, character, this.activeLength - 1);
    }

    public ActivePoint moveToNextNodeOfActiveEdge(final String word, final int index) {
        return new ActivePoint(this.getActiveEdge().getNext(),
                word.charAt(index - this.activeLength + this.getActiveEdge().getLength()),
                this.activeLength - this.getActiveEdge().getLength());
    }

    public void addEdgeToActiveNode(final char character, final Edge edge) {
        this.activeNode.getEdges().put(character, edge);
    }

    public void splitActiveEdge(final String word,
                                final Node nodeToAdd,
                                final int index,
                                final char character) {
        final Edge activeEdgeToSplit = this.getActiveEdge();
        final Edge splittedEdge = new Edge(activeEdgeToSplit.getFrom(),
                activeEdgeToSplit.getFrom() + this.activeLength,
                nodeToAdd);
        nodeToAdd.getEdges().put(word.charAt(activeEdgeToSplit.getFrom() + this.activeLength),
                new Edge(activeEdgeToSplit.getFrom() + this.activeLength,
                        activeEdgeToSplit.getTo(),
                        activeEdgeToSplit.getNext()));
        nodeToAdd.getEdges().put(character, new Edge(index, word.length(), null));
        this.activeNode.getEdges().put(this.activeEdgeFirstCharacter, splittedEdge);
    }

    public Node setSlinkTo(final Node previouslyAddedNodeOrAddedEdgeNode,
                           final Node node) {
        if(previouslyAddedNodeOrAddedEdgeNode != null) {
            previouslyAddedNodeOrAddedEdgeNode.setSlink(node);
        }
        return node;
    }

    public Node setSlinkToActiveNode(final Node previouslyAddedNodeOrAddedEdgeNode) {
        return setSlinkTo(previouslyAddedNodeOrAddedEdgeNode, this.activeNode);
    }
  }

  private static int idGenerator;

  private final String word;
  private final Node root;
  private ActivePoint activePoint;
  private int remainder;

  public ST(final String word) {
    this.word = word;
    this.root = new Node(idGenerator++);
    this.activePoint = new ActivePoint(this.root, null, 0);
    this.remainder = 0;
    build();
  }

  private void build() {
    for(int i = 0; i < this.word.length(); i++) {
        add(i, this.word.charAt(i));
    }
  }

  private void add(final int index, final char character) {
    this.remainder++;
    boolean characterFoundInTheTree = false;
    Node previouslyAddedNodeOrAddedEdgeNode = null;
    while(!characterFoundInTheTree && this.remainder > 0) {
        if(this.activePoint.pointsToActiveNode()) {
            if(this.activePoint.activeNodeHasEdgeStartingWith(character)) {
                activeNodeHasEdgeStartingWithCharacter(character, previouslyAddedNodeOrAddedEdgeNode);
                characterFoundInTheTree = true;
            }
            else {
                if(this.activePoint.activeNodeIs(this.root)) {
                    rootNodeHasNotEdgeStartingWithCharacter(index, character);
                }
                else {
                    previouslyAddedNodeOrAddedEdgeNode = internalNodeHasNotEdgeStartingWithCharacter(index,
                            character, previouslyAddedNodeOrAddedEdgeNode);
                }
            }
        }
        else {
            if(this.activePoint.pointsToOnActiveEdge(this.word, character)) {
                activeEdgeHasCharacter();
                characterFoundInTheTree = true;
            }
            else {
                if(this.activePoint.activeNodeIs(this.root)) {
                    previouslyAddedNodeOrAddedEdgeNode = edgeFromRootNodeHasNotCharacter(index,
                            character,
                            previouslyAddedNodeOrAddedEdgeNode);
                }
                else {
                    previouslyAddedNodeOrAddedEdgeNode = edgeFromInternalNodeHasNotCharacter(index,
                            character,
                            previouslyAddedNodeOrAddedEdgeNode);
                }
            }
        }
    }
  }

  private void activeNodeHasEdgeStartingWithCharacter(final char character,
                                                    final Node previouslyAddedNodeOrAddedEdgeNode) {
    this.activePoint.setSlinkToActiveNode(previouslyAddedNodeOrAddedEdgeNode);
    this.activePoint = this.activePoint.moveToEdgeStartingWithAndByOne(character);
    if(this.activePoint.pointsToTheEndOfActiveEdge()) {
        this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge();
    }
  }

  private void rootNodeHasNotEdgeStartingWithCharacter(final int index, final char character) {
    this.activePoint.addEdgeToActiveNode(character, new Edge(index, this.word.length(), null));
    this.activePoint = this.activePoint.moveTo(this.root);
    this.remainder--;
    assert this.remainder == 0;
  }

  private Node internalNodeHasNotEdgeStartingWithCharacter(final int index,
                                                         final char character,
                                                         Node previouslyAddedNodeOrAddedEdgeNode) {
    this.activePoint.addEdgeToActiveNode(character, new Edge(index, this.word.length(), null));
    previouslyAddedNodeOrAddedEdgeNode = this.activePoint.setSlinkToActiveNode(previouslyAddedNodeOrAddedEdgeNode);
    if(this.activePoint.activeNodeHasSlink()) {
        this.activePoint = this.activePoint.moveToSlink();
    }
    else {
        this.activePoint = this.activePoint.moveTo(this.root);
    }
    this.remainder--;
    return previouslyAddedNodeOrAddedEdgeNode;
  }

  private void activeEdgeHasCharacter() {
    this.activePoint = this.activePoint.moveByOneCharacter();
    if(this.activePoint.pointsToTheEndOfActiveEdge()) {
        this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge();
    }
  }

  private Node edgeFromRootNodeHasNotCharacter(final int index,
                                             final char character,
                                             Node previouslyAddedNodeOrAddedEdgeNode) {
    final Node newNode = new Node(idGenerator++);
    this.activePoint.splitActiveEdge(this.word, newNode, index, character);
    previouslyAddedNodeOrAddedEdgeNode = this.activePoint.setSlinkTo(previouslyAddedNodeOrAddedEdgeNode, newNode);
    this.activePoint = this.activePoint.moveToEdgeStartingWithAndByActiveLengthMinusOne(this.root,
            this.word.charAt(index - this.remainder + 2));
    this.activePoint = walkDown(index);
    this.remainder--;
    return previouslyAddedNodeOrAddedEdgeNode;
  }

  private Node edgeFromInternalNodeHasNotCharacter(final int index,
                                                 final char character,
                                                 Node previouslyAddedNodeOrAddedEdgeNode) {
    final Node newNode = new Node(idGenerator++);
    this.activePoint.splitActiveEdge(this.word, newNode, index, character);
    previouslyAddedNodeOrAddedEdgeNode = this.activePoint.setSlinkTo(previouslyAddedNodeOrAddedEdgeNode, newNode);
    if(this.activePoint.activeNodeHasSlink()) {
        this.activePoint = this.activePoint.moveToSlink();
    }
    else {
        this.activePoint = this.activePoint.moveTo(this.root);
    }
    this.activePoint = walkDown(index);
    this.remainder--;
    return previouslyAddedNodeOrAddedEdgeNode;
  }

  private ActivePoint walkDown(final int index) {
    while(!this.activePoint.pointsToActiveNode()
            && (this.activePoint.pointsToTheEndOfActiveEdge() || this.activePoint.pointsAfterTheEndOfActiveEdge())) {
        if(this.activePoint.pointsAfterTheEndOfActiveEdge()) {
            this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge(this.word, index);
        }
        else {
            this.activePoint = this.activePoint.moveToNextNodeOfActiveEdge();
        }
    }
    return this.activePoint;
  }

  public String toString(final String word) {
    return this.root.toString(word);
  }

  public boolean contains(final String suffix) {
    return this.root.contains(this.word, suffix);
  }

  public static void main(final String[] args) {
    final String[] words = {
            "abcabcabc$",
            "abc$",
            "abcabxabcd$",
            "abcabxabda$",
            "abcabxad$",
            "aabaaabb$",
            "aababcabcd$",
            "ababcabcd$",
            "abccba$",
            "mississipi$",
            "abacabadabacabae$",
            "abcabcd$",
            "00132220$"
    };
    Arrays.stream(words).forEach(word -> {
        System.out.println("Building suffix tree for word: " + word);
        final ST suffixTree = new ST(word);
        System.out.println("Suffix tree: " + suffixTree.toString(word));
        for(int i = 0; i < word.length() - 1; i++) {
            assert suffixTree.contains(word.substring(i)) : word.substring(i);
        }
    });
  }
}

回答by Peter de Rivaz

My intuition is as follows:

我的直觉如下:

After k iterations of the main loop you have constructed a suffix tree which contains all suffixes of the complete string that start in the first k characters.

在主循环的 k 次迭代之后,您已经构建了一个后缀树,其中包含从前 k 个字符开始的完整字符串的所有后缀。

At the start, this means the suffix tree contains a single root node that represents the entire string (this is the only suffix that starts at 0).

一开始,这意味着后缀树包含一个代表整个字符串的根节点(这是唯一从 0 开始的后缀)。

After len(string) iterations you have a suffix tree that contains all suffixes.

在 len(string) 迭代之后,您有一个包含所有后缀的后缀树。

During the loop the key is the active point. My guess is that this represents the deepest point in the suffix tree that corresponds to a proper suffix of the first k characters of the string. (I think proper means that the suffix cannot be the entire string.)

在循环期间,关键是活动点。我的猜测是,这表示后缀树中最深的点,对应于字符串前 k 个字符的正确后缀。(我认为正确意味着后缀不能是整个字符串。)

For example, suppose you have seen characters 'abcabc'. The active point would represent the point in the tree corresponding to the suffix 'abc'.

例如,假设您看到了字符“abcabc”。活动点将代表树中与后缀“abc”相对应的点。

The active point is represented by (origin,first,last). This means that you are currently at the point in the tree that you get to by starting at node origin and then feeding in the characters in string[first:last]

活动点由 (origin,first,last) 表示。这意味着您当前位于通过从节点原点开始然后输入 string[first:last] 中的字符而到达的树中的点

When you add a new character you look to see whether the active point is still in the existing tree. If it is then you are done. Otherwise you need to add a new node to the suffix tree at the active point, fallback to the next shortest match, and check again.

添加新角色时,您会查看活动点是否仍在现有树中。如果是,那么你就完成了。否则,您需要在活动点的后缀树中添加一个新节点,回退到下一个最短匹配,然后再次检查。

Note 1: The suffix pointers give a link to the next shortest match for each node.

注 1:后缀指针给出了每个节点下一个最短匹配的链接。

Note 2: When you add a new node and fallback you add a new suffix pointer for the new node. The destination for this suffix pointer will be the node at the shortened active point. This node will either already exist, or be created on the next iteration of this fallback loop.

注意 2:当您添加新节点和回退时,您会为新节点添加一个新的后缀指针。此后缀指针的目的地将是缩短的活动点处的节点。该节点要么已经存在,要么在此回退循环的下一次迭代中创建。

Note 3: The canonization part simply saves time in checking the active point. For example, suppose you always used origin=0, and just changed first and last. To check the active point you would have to follow the suffix tree each time along all the intermediate nodes. It makes sense to cache the result of following this path by recording just the distance from the last node.

注 3:规范化部分只是节省了检查活动点的时间。例如,假设您始终使用 origin=0,并且只是更改了 first 和 last。要检查活动点,您必须每次沿着所有中间节点跟踪后缀树。通过仅记录与最后一个节点的距离来缓存遵循此路径的结果是有意义的。

Can you give a code example of what you mean by "fix" bounding variables?

你能举一个代码示例来说明“修复”边界变量的含义吗?

Health warning: I also found this algorithm particularly hard to understand so please realise that this intuition is likely to be incorrect in all important details...

健康警告:我还发现这个算法特别难以理解,所以请注意,这种直觉很可能在所有重要细节上都不正确......

回答by Suchit Puri

Hi i have tried to implement the above explained implementation in ruby , please check it out. it seems to work fine.

嗨,我已经尝试在 ruby​​ 中实现上面解释的实现,请检查一下。它似乎工作正常。

the only difference in the implementation is that , i have tried to use the edge object instead of just using symbols.

实现中的唯一区别是,我尝试使用边缘对象而不是仅使用符号。

its also present at https://gist.github.com/suchitpuri/9304856

它也存在于https://gist.github.com/suchitpuri/9304856

    require 'pry'


class Edge
    attr_accessor :data , :edges , :suffix_link
    def initialize data
        @data = data
        @edges = []
        @suffix_link = nil
    end

    def find_edge element
        self.edges.each do |edge|
            return edge if edge.data.start_with? element
        end
        return nil
    end
end

class SuffixTrees
    attr_accessor :root , :active_point , :remainder , :pending_prefixes , :last_split_edge , :remainder

    def initialize
        @root = Edge.new nil
        @active_point = { active_node: @root , active_edge: nil , active_length: 0}
        @remainder = 0
        @pending_prefixes = []
        @last_split_edge = nil
        @remainder = 1
    end

    def build string
        string.split("").each_with_index do |element , index|


            add_to_edges @root , element        

            update_pending_prefix element                           
            add_pending_elements_to_tree element
            active_length = @active_point[:active_length]

            # if(@active_point[:active_edge] && @active_point[:active_edge].data && @active_point[:active_edge].data[0..active_length-1] ==  @active_point[:active_edge].data[active_length..@active_point[:active_edge].data.length-1])
            #   @active_point[:active_edge].data = @active_point[:active_edge].data[0..active_length-1]
            #   @active_point[:active_edge].edges << Edge.new(@active_point[:active_edge].data)
            # end

            if(@active_point[:active_edge] && @active_point[:active_edge].data && @active_point[:active_edge].data.length == @active_point[:active_length]  )
                @active_point[:active_node] =  @active_point[:active_edge]
                @active_point[:active_edge] = @active_point[:active_node].find_edge(element[0])
                @active_point[:active_length] = 0
            end
        end
    end

    def add_pending_elements_to_tree element

        to_be_deleted = []
        update_active_length = false
        # binding.pry
        if( @active_point[:active_node].find_edge(element[0]) != nil)
            @active_point[:active_length] = @active_point[:active_length] + 1               
            @active_point[:active_edge] = @active_point[:active_node].find_edge(element[0]) if @active_point[:active_edge] == nil
            @remainder = @remainder + 1
            return
        end



        @pending_prefixes.each_with_index do |pending_prefix , index|

            # binding.pry           

            if @active_point[:active_edge] == nil and @active_point[:active_node].find_edge(element[0]) == nil

                @active_point[:active_node].edges << Edge.new(element)

            else

                @active_point[:active_edge] = node.find_edge(element[0]) if @active_point[:active_edge]  == nil

                data = @active_point[:active_edge].data
                data = data.split("")               

                location = @active_point[:active_length]


                # binding.pry
                if(data[0..location].join == pending_prefix or @active_point[:active_node].find_edge(element) != nil )                  


                else #tree split    
                    split_edge data , index , element
                end

            end
        end 
    end



    def update_pending_prefix element
        if @active_point[:active_edge] == nil
            @pending_prefixes = [element]
            return

        end

        @pending_prefixes = []

        length = @active_point[:active_edge].data.length
        data = @active_point[:active_edge].data
        @remainder.times do |ctr|
                @pending_prefixes << data[-(ctr+1)..data.length-1]
        end

        @pending_prefixes.reverse!

    end

    def split_edge data , index , element
        location = @active_point[:active_length]
        old_edges = []
        internal_node = (@active_point[:active_edge].edges != nil)

        if (internal_node)
            old_edges = @active_point[:active_edge].edges 
            @active_point[:active_edge].edges = []
        end

        @active_point[:active_edge].data = data[0..location-1].join                 
        @active_point[:active_edge].edges << Edge.new(data[location..data.size].join)


        if internal_node
            @active_point[:active_edge].edges << Edge.new(element)
        else
            @active_point[:active_edge].edges << Edge.new(data.last)        
        end

        if internal_node
            @active_point[:active_edge].edges[0].edges = old_edges
        end


        #setup the suffix link
        if @last_split_edge != nil and @last_split_edge.data.end_with?@active_point[:active_edge].data 

            @last_split_edge.suffix_link = @active_point[:active_edge] 
        end

        @last_split_edge = @active_point[:active_edge]

        update_active_point index

    end


    def update_active_point index
        if(@active_point[:active_node] == @root)
            @active_point[:active_length] = @active_point[:active_length] - 1
            @remainder = @remainder - 1
            @active_point[:active_edge] = @active_point[:active_node].find_edge(@pending_prefixes.first[index+1])
        else
            if @active_point[:active_node].suffix_link != nil
                @active_point[:active_node] = @active_point[:active_node].suffix_link               
            else
                @active_point[:active_node] = @root
            end 
            @active_point[:active_edge] = @active_point[:active_node].find_edge(@active_point[:active_edge].data[0])
            @remainder = @remainder - 1     
        end
    end

    def add_to_edges root , element     
        return if root == nil
        root.data = root.data + element if(root.data and root.edges.size == 0)
        root.edges.each do |edge|
            add_to_edges edge , element
        end
    end
end

suffix_tree = SuffixTrees.new
suffix_tree.build("abcabxabcd")
binding.pry