C++ 后缀数组算法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17761704/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Suffix Array Algorithm
提问by Spandan
After quite a bit of reading, I have figured out what a suffix array and LCP array represents.
经过大量阅读,我已经弄清楚后缀数组和 LCP 数组代表什么。
Suffix array: Represents the _lexicographic rank of each suffix of an array.
后缀数组:表示数组每个后缀的_lexicographic rank。
LCP array: Contains the maximum length prefix match between two consecutive suffixes, after they are sorted lexicographically.
LCP 数组:包含两个连续后缀之间的最大长度前缀匹配,按字典序排序后。
I have been trying hard to understand since a couple of days , how exactly the suffix array and LCP algorithm works.
几天以来我一直在努力理解后缀数组和 LCP 算法究竟是如何工作的。
Here is the code , which is taken from Codeforces:
这是从Codeforces 中获取的代码:
/*
Suffix array O(n lg^2 n)
LCP table O(n)
*/
#include <cstdio>
#include <algorithm>
#include <cstring>
using namespace std;
#define REP(i, n) for (int i = 0; i < (int)(n); ++i)
namespace SuffixArray
{
const int MAXN = 1 << 21;
char * S;
int N, gap;
int sa[MAXN], pos[MAXN], tmp[MAXN], lcp[MAXN];
bool sufCmp(int i, int j)
{
if (pos[i] != pos[j])
return pos[i] < pos[j];
i += gap;
j += gap;
return (i < N && j < N) ? pos[i] < pos[j] : i > j;
}
void buildSA()
{
N = strlen(S);
REP(i, N) sa[i] = i, pos[i] = S[i];
for (gap = 1;; gap *= 2)
{
sort(sa, sa + N, sufCmp);
REP(i, N - 1) tmp[i + 1] = tmp[i] + sufCmp(sa[i], sa[i + 1]);
REP(i, N) pos[sa[i]] = tmp[i];
if (tmp[N - 1] == N - 1) break;
}
}
void buildLCP()
{
for (int i = 0, k = 0; i < N; ++i) if (pos[i] != N - 1)
{
for (int j = sa[pos[i] + 1]; S[i + k] == S[j + k];)
++k;
lcp[pos[i]] = k;
if (k)--k;
}
}
} // end namespace SuffixArray
I cannot, just cannot get through how this algorithm works. I tried working on an example using pencil and paper, and wrote through the steps involved, but lost link in between as its too complicated, for me at least.
我不能,就是不能理解这个算法是如何工作的。我试着用铅笔和纸做一个例子,并写下所涉及的步骤,但由于太复杂,至少对我来说,这中间失去了联系。
Any help regarding explanation, using an example maybe, is highly appreciated.
任何关于解释的帮助,使用一个例子,都非常感谢。
回答by jogojapan
Overview
概述
This is an O(n log n) algorithm for suffix array construction (or rather, it would be, if instead of ::sort
a 2-pass bucket sort had been used).
这是用于后缀数组构造的 O(n log n) 算法(或者更确切地说,如果使用::sort
了 2-pass 桶排序,则应该如此)。
It works by first sorting the 2-grams(*), then the 4-grams, then the 8-grams, and so forth, of the original string S
, so in the i-th iteration, we sort the 2i-grams. There can obviously be no more than log2(n) such iterations, and the trick is that sorting the 2i-grams in the i-th step is facilitated by making sure that each comparison of two 2i-grams is done in O(1) time (rather than O(2i) time).
它的工作原理是首先对原始字符串的 2-gram (*) 进行排序,然后是 4-gram,然后是 8-gram,依此类推S
,因此在第 i 次迭代中,我们对 2 i-gram 进行排序。有可明显不超过日志2(n)的这样的迭代,并且诀窍是,排序2我-grams中的第i个步骤是通过确保两个2每次比较容易我-grams是为O完成(1) 时间(而不是 O(2 i) 时间)。
How does it do this? Well, in the first iterationit sorts the 2-grams (aka bigrams), and then performs what is called lexicographic renaming. This means it creates a new array (of length n
) that stores, for each bigram, its rankin the bigram sorting.
它是如何做到这一点的?好吧,在第一次迭代中,它对 2-grams(又名 bigrams)进行排序,然后执行所谓的lexicographic renaming。这意味着它会创建一个新数组(长度为n
),用于存储每个二元组在二元组排序中的排名。
Example for lexicographic renaming:Say we have a sortedlist of some bigrams {'ab','ab','ca','cd','cd','ea'}
. We then assign ranks(i.e. lexicographic names) by going from left to right, starting with rank 0 and incrementing the rank whenever we encounter a newbigram changes. So the ranks we assign are as follows:
字典重命名示例:假设我们有一些 bigrams的排序列表{'ab','ab','ca','cd','cd','ea'}
。然后我们通过从左到右分配等级(即词典名称),从等级 0 开始,并在遇到新的二元组变化时增加等级。所以我们分配的等级如下:
ab : 0
ab : 0 [no change to previous]
ca : 1 [increment because different from previous]
cd : 2 [increment because different from previous]
cd : 2 [no change to previous]
ea : 3 [increment because different from previous]
These ranks are known as lexicographic names.
这些等级被称为词典名称。
Now, in the next iteration, we sort 4-grams. This involves a lot of comparisons between different 4-grams. How do we compare two 4-grams? Well, we could compare them character by character. That would be up to 4 operations per comparison. But instead, we compare them by looking upthe ranks of the two bigrams contained in them, using the rank table generated in the previous steps. That rank represents the lexicographic rank from the previous 2-gram sort, so if for any given 4-gram, its first 2-gram has a higher rank than the first 2-gram of another 4-gram, then it must be lexicographically greater somewhere in the first two characters. Hence, if for two 4-grams the rank of the first 2-gram is identical, they must be identical in the first two characters. In other words, two look-upsin the rank table are sufficient to compare all 4 characters of the two 4-grams.
现在,在下一次迭代中,我们对 4-gram 进行排序。这涉及到不同 4-gram 之间的大量比较。我们如何比较两个 4-gram?好吧,我们可以逐个比较它们。每次比较最多需要 4 次操作。但是相反,我们通过比较它们查找包含在其中两个二元语法的行列,用在前面的步骤中产生的排名表。该等级表示前一个 2-gram 排序的词典排序,因此如果对于任何给定的 4-gram,它的第一个 2-gram 的等级高于另一个 4-gram 的第一个 2-gram,那么它必须在词典顺序上更大在前两个字符的某个地方。因此,如果两个 4-gram 的第一个 2-gram 的秩相同,则它们在前两个字符。换句话说,在等级表中的两次查找足以比较两个 4-gram 的所有 4 个字符。
After sorting, we create new lexicographic names again, this time for the 4-grams.
排序后,我们再次创建新的词典名称,这次是 4-gram。
In the third iteration, we need to sort by 8-grams. Again, two look-ups in the lexicographic rank table from the previous step are sufficient to compare all 8 characters of two given 8-grams.
在第三次迭代中,我们需要按 8-gram 进行排序。同样,在上一步的词典排序表中的两次查找足以比较两个给定的 8-gram 的所有 8 个字符。
And so forth. Each iteration i
has two steps:
等等。每次迭代i
有两个步骤:
Sorting by 2i-grams, using the lexicographic names from the previous iteration to enable comparisons in 2 steps (i.e. O(1) time) each
Creating new lexicographic names
按 2 个i-gram排序,使用来自前一次迭代的字典名称,以每个步骤分 2 步(即 O(1) 时间)进行比较
创建新的字典名称
We repeat this until all 2i-grams are different. If that happens, we are done. How do we know if all are different? Well, the lexicographic names are an increasing sequence of integers, starting with 0. So if the highest lexicographic name generated in an iteration is the same as n-1
, then each 2i-gram must have been given its own, distinct lexicographic name.
我们重复这个直到所有 2 i-grams 都不同。如果发生这种情况,我们就完成了。我们如何知道是否一切都不同?嗯,词典名称是一个递增的整数序列,从 0 开始。因此,如果在迭代中生成的最高词典名称与 相同n-1
,那么每个 2 i-gram 必须被赋予其自己的、不同的词典名称。
Implementation
执行
Now let's look at the code to confirm all of this. The variables used are as follows: sa[]
is the suffix array we are building. pos[]
is the rank lookup-table (i.e. it contains the lexicographic names), specifically, pos[k]
contains the lexicographic name of the k
-th m-gram of the previous step. tmp[]
is an auxiliary array used to help create pos[]
.
现在让我们看看代码来确认这一切。使用的变量如下:sa[]
是我们正在构建的后缀数组。pos[]
是排名查找表(即它包含词典名称),具体而言,pos[k]
包含k
上一步的第 -th 个 m-gram的词典名称。tmp[]
是一个辅助数组,用于帮助创建pos[]
.
I'll give further explanations between the code lines:
我将在代码行之间给出进一步的解释:
void buildSA()
{
N = strlen(S);
/* This is a loop that initializes sa[] and pos[].
For sa[] we assume the order the suffixes have
in the given string. For pos[] we set the lexicographic
rank of each 1-gram using the characters themselves.
That makes sense, right? */
REP(i, N) sa[i] = i, pos[i] = S[i];
/* Gap is the length of the m-gram in each step, divided by 2.
We start with 2-grams, so gap is 1 initially. It then increases
to 2, 4, 8 and so on. */
for (gap = 1;; gap *= 2)
{
/* We sort by (gap*2)-grams: */
sort(sa, sa + N, sufCmp);
/* We compute the lexicographic rank of each m-gram
that we have sorted above. Notice how the rank is computed
by comparing each n-gram at position i with its
neighbor at i+1. If they are identical, the comparison
yields 0, so the rank does not increase. Otherwise the
comparison yields 1, so the rank increases by 1. */
REP(i, N - 1) tmp[i + 1] = tmp[i] + sufCmp(sa[i], sa[i + 1]);
/* tmp contains the rank by position. Now we map this
into pos, so that in the next step we can look it
up per m-gram, rather than by position. */
REP(i, N) pos[sa[i]] = tmp[i];
/* If the largest lexicographic name generated is
n-1, we are finished, because this means all
m-grams must have been different. */
if (tmp[N - 1] == N - 1) break;
}
}
About the comparison function
关于比较功能
The function sufCmp
is used to compare two (2*gap)-grams lexicographically. So in the first iteration it compares bigrams, in the second iteration 4-grams, then 8-grams and so on. This is controlled by gap
, which is a global variable.
该函数sufCmp
用于按字典顺序比较两个 (2*gap)-gram。所以在第一次迭代中它比较二元组,在第二次迭代中比较 4-grams,然后是 8-grams,依此类推。这是由 控制的gap
,它是一个全局变量。
A naive implementation of sufCmp
would be this:
一个天真的实现sufCmp
是这样的:
bool sufCmp(int i, int j)
{
int pos_i = sa[i];
int pos_j = sa[j];
int end_i = pos_i + 2*gap;
int end_j = pos_j + 2*gap;
if (end_i > N)
end_i = N;
if (end_j > N)
end_j = N;
while (i < end_i && j < end_j)
{
if (S[pos_i] != S[pos_j])
return S[pos_i] < S[pos_j];
pos_i += 1;
pos_j += 1;
}
return (pos_i < N && pos_j < N) ? S[pos_i] < S[pos_j] : pos_i > pos_j;
}
This would compare the (2*gap)-gram at the beginning of the i-th suffix pos_i:=sa[i]
with the one found at the beginning of the j-th suffix pos_j:=sa[j]
. And it would compare them character by character, i.e. comparing S[pos_i]
with S[pos_j]
, then S[pos_i+1]
with S[pos_j+1]
and so on. It continues as long as the characters are identical. Once they differ, it returns 1 if the character in the i-th suffix is smaller than the one in the j-th suffix, 0 otherwise. (Note that return a<b
in a function returning int
means you return 1 if the condition is true, and 0 if it is false.)
这将在开始时比较(2 *间隙)-gram第i个后缀pos_i:=sa[i]
与所述一个发现在第j个后缀的开始pos_j:=sa[j]
。并且它会逐个字符地比较它们,即比较S[pos_i]
与S[pos_j]
,然后S[pos_i+1]
与S[pos_j+1]
等等。只要字符相同,它就会继续。一旦它们不同,如果第 i 个后缀中的字符小于第 j 个后缀中的字符,则返回 1,否则返回 0。(请注意,return a<b
在函数中返回int
意味着如果条件为真则返回 1,如果条件为假则返回 0。)
The complicated looking condition in the return-statement deals with the case that one of the (2*gap)-grams is located at the end of the string. In this case either pos_i
or pos_j
will reach N
before all (2*gap) characters have been compared, even if all characters up to that point are identical. It will then return 1 if the i-th suffix is at the end, and 0 if the j-th suffix is at the end. This is correct because if all characters are identical, the shorterone is lexicographically smaller. If pos_i
has reached the end, the i-th suffix must be shorter than the j-th suffix.
return 语句中的复杂查找条件处理 (2*gap)-gram 之一位于字符串末尾的情况。在这种情况下,pos_i
或者pos_j
将N
在所有 (2*gap) 字符被比较之前到达,即使到该点的所有字符都相同。如果第 i 个后缀在末尾,则返回 1,如果第 j 个后缀在末尾,则返回 0。这是正确的,因为如果所有字符都相同,则较短的字符在字典上更小。如果pos_i
已经到了结尾,第 i 个后缀必须比第 j 个后缀短。
Clearly, this naive implementation is O(gap), i.e. its complexity is linear in the length of the (2*gap)-grams. The function used in your code, however, uses the lexicographic names to bring this down to O(1) (specifically, down to a maximum of two comparisons):
显然,这个简单的实现是 O(gap),即它的复杂性在 (2*gap)-gram 的长度上是线性的。但是,您的代码中使用的函数使用字典名称将其降低到 O(1)(具体而言,最多降低到两次比较):
bool sufCmp(int i, int j)
{
if (pos[i] != pos[j])
return pos[i] < pos[j];
i += gap;
j += gap;
return (i < N && j < N) ? pos[i] < pos[j] : i > j;
}
As you can see, instead of looking up individual characters S[i]
and S[j]
, we check the lexicographic rank of the i-th and j-th suffix. Lexicographic ranks were computed in the previous iteration for gap-grams. So, if pos[i] < pos[j]
, then the i-th suffix sa[i]
must start with a gap-gram that is lexicographically smaller than the gap-gram at the beginning of sa[j]
. In other words, simply by looking up pos[i]
and pos[j]
and comparing them, we have compared the first gapcharacters of the two suffixes.
如您所见,我们不是查找单个字符S[i]
and S[j]
,而是检查第 i 个和第 j 个后缀的字典顺序。在上一次迭代中计算了gap-grams的字典排序。所以,如果pos[i] < pos[j]
,那么第 i 个后缀sa[i]
必须以一个在字典上比 开头的 gap-gram 小的 gap-gram 开始sa[j]
。换句话说,只需通过查找pos[i]
并pos[j]
和他们比较,我们比较了第一个缺口的两个后缀的字符。
If the ranks are identical, we continue by comparing pos[i+gap]
with pos[j+gap]
. This is the same as comparing the next gapcharacters of the (2*gap)-grams, i.e. the second half. If the ranks are indentical again, the two (2*gap)-grams are indentical, so we return 0. Otherwise we return 1 if the i-th suffix is smaller than the j-th suffix, 0 otherwise.
如果等级相同,我们将继续通过比较pos[i+gap]
同pos[j+gap]
。这与比较(2*gap)-gram的下一个间隙字符相同,即后半部分。如果秩再次相同,则两个 (2*gap)-gram 相同,因此我们返回 0。否则,如果第 i 个后缀小于第 j 个后缀,则返回 1,否则返回 0。
Example
例子
The following example illustrates how the algorithm operates, and demonstrates in particular the role of the lexicographic names in the sorting algorithm.
以下示例说明了该算法的运行方式,并特别说明了词典名称在排序算法中的作用。
The string we want to sort is abcxabcd
. It takes three iterations to generate the suffix array for this. In each iteration, I'll show S
(the string), sa
(the current state of the suffix array) and tmp
and pos
, which represent the lexicographic names.
我们要排序的字符串是abcxabcd
。为此生成后缀数组需要 3 次迭代。在每次迭代中,我将显示S
(字符串)、sa
(后缀数组的当前状态)和tmp
和pos
,它们代表字典名称。
First, we initialize:
首先,我们初始化:
S abcxabcd
sa 01234567
pos abcxabcd
Note how the lexicographic names, which initially represent the lexicographic rank of unigrams, are simply identical to the characters (i.e. the unigrams) themselves.
请注意最初表示一元组的字典顺序的字典名称是如何与字符(即一元组)本身完全相同的。
First iteration:
第一次迭代:
Sorting sa
, using bigrams as sorting criterion:
排序sa
,使用二元组作为排序标准:
sa 04156273
The first two suffixes are 0 and 4 because those are the positions of bigram 'ab'. Then 1 and 5 (positions of bigram 'bc'), then 6 (bigram 'cd'), then 2 (bigram 'cx'). then 7 (incomplete bigram 'd'), then 3 (bigram 'xa'). Clearly, the positions correspond to the order, based solely on character bigrams.
前两个后缀是 0 和 4,因为它们是二元字母“ab”的位置。然后是 1 和 5(bigram 'bc' 的位置),然后是 6(bigram 'cd'),然后是 2(bigram 'cx')。然后是 7(不完整的二元词“d”),然后是 3(二元词“xa”)。显然,位置对应于顺序,完全基于字符二元组。
Generating the lexicographic names:
生成字典名称:
tmp 00112345
As described, lexicographic names are assigned as increasing integers. The first two suffixes (both starting with bigram 'ab') get 0, the next two (both starting with bigram 'bc') get 1, then 2, 3, 4, 5 (each a different bigram).
如上所述,词典名称被分配为递增的整数。前两个后缀(均以 bigram 'ab' 开头)为 0,接下来的两个后缀(均以 bigram 'bc' 开头)为 1,然后是 2、3、4、5(每个都是不同的 bigram)。
Finally, we map this according to the positions in sa
, to get pos
:
最后,我们根据 中的位置映射它sa
,得到pos
:
sa 04156273
tmp 00112345
pos 01350124
(The way pos
is generated is this: Go through sa
from left to right, and use the entry to define the index in pos
. Use the corresponding entry in tmp
to define the value for that index. So pos[0]:=0
, pos[4]:=0
, pos[1]:=1
, pos[5]:=1
, pos[6]:=2
, and so on. The index comes from sa
, the value from tmp
.)
(方式pos
产生是这样的:经过sa
从左至右,并使用项来定义索引pos
,使用在相应的条目tmp
。定义为指数值那么pos[0]:=0
,pos[4]:=0
,pos[1]:=1
,pos[5]:=1
,pos[6]:=2
,,等指数来自sa
,值来自tmp
。)
Second iteration:
第二次迭代:
We sort sa
again, and again we look at bigrams from pos
(which each represents a sequence of twobigrams of the original string).
我们sa
再次排序,并再次查看来自pos
(每个代表原始字符串的两个二元组的序列)的双元组。
sa 04516273
Notice how the position of 1 5 have switched compared to the previous version of sa
. It used to be 15, now it is 51. This is because the bigram at pos[1]
and the bigram at pos[5]
used to be identical (both bc
) in during the previous iteration, but now the bigram at pos[5]
is 12
, while the bigram at pos[1]
is 13
. So position 5
comes beforeposition 1
. This is due to the fact that the lexicographic names now each represent bigrams of the original string: pos[5]
represents bc
and pos[6]
represents 'cd'. So, together they represent bcd
, while pos[1]
represents bc
and pos[2]
represents cx
, so together they represent bcx
, which is indeed lexicographically greater than bcd
.
请注意 1 5 的位置与sa
. 以前是 15,现在是 51。这是因为在上一次迭代中,二元语法 atpos[1]
和二元语法 atpos[5]
曾经是相同的(两者都是bc
),但现在二元语法 atpos[5]
是12
,而二元语法 atpos[1]
是13
。所以 position5
出现在position之前1
。这是因为词典名称现在每个都代表原始字符串的二元组:pos[5]
代表bc
和pos[6]
代表 'cd'。所以,他们一起代表bcd
,而pos[1]
代表bc
和pos[2]
代表cx
,所以他们一起代表bcx
,这确实在字典序上大于bcd
。
Again, we generate lexicographic names by screening the current version of sa
from left to right and comparing the corrsponding bigrams in pos
:
同样,我们通过sa
从左到右筛选当前版本并比较 中的相应二元组来生成词典名称pos
:
tmp 00123456
The first two entries are still identical (both 0), because the corresponding bigrams in pos
are both 01
. The rest is an strictly increasing sequence of integers, because all other bigrams in pos
are each unique.
前两个条目仍然相同(都是 0),因为对应的双元组pos
都是01
。其余的是严格递增的整数序列,因为其中的所有其他二元组pos
都是唯一的。
We perform the mapping to the new pos
as before (taking indices from sa
and values from tmp
):
我们pos
像以前一样执行到新的映射(从 中获取索引sa
和值tmp
):
sa 04516273
tmp 00123456
pos 02460135
Third iteration:
第三次迭代:
We sort sa
again, taking bigrams of pos
(as always), which now each represents a sequence of 4 bigrams of the orginal string.
我们sa
再次排序,取pos
(一如既往)的二元组,现在每个表示原始字符串的 4 个二元组的序列。
sa 40516273
You'll notice that now the first two entries have switched positions: 04
has become 40
. This is because the bigram at pos[0]
is 02
while the one at pos[4]
is 01
, the latter obviously being lexicographically smaller. The deep reason is that these two represent abcx
and abcd
, respectively.
您会注意到,现在前两个条目已经交换了位置:04
已变为40
。这是因为二元组 at pos[0]
is02
而一个 atpos[4]
是01
,后者显然在字典序上更小。深层原因是这两个分别代表abcx
和abcd
。
Generating lexicographic names yields:
生成字典名称产生:
tmp 01234567
They are all different, i.e. the highest one is 7
, which is n-1
. So, we are done, because are sorting is now based on m-grams that are all different. Even if we continued, the sorting order would not change.
它们都不同,即最高的是7
,即n-1
。所以,我们完成了,因为现在排序是基于所有不同的 m-gram。即使我们继续,排序顺序也不会改变。
Improvement suggestion
改进建议
The algorithm used to sort the 2i-grams in each iteration appears to be the built-in sort
(or std::sort
). This means it's a comparison sort, which takes O(n log n) time in the worst case, in each iteration. Since there are log n iterations in the worst case, this makes it a O(n (log n)2)-time algorithm. However, the sorting could by performed using two passes of bucket sort, since the keys we use for the sort comparison (i.e. the lexicographic names of the previous step), form an increasing integer sequence. So this could be improved to an actual O(n log n)-time algorithm for suffix sorting.
用于在每次迭代中对2 个i-gram进行排序的算法似乎是内置的sort
(或std::sort
)。这意味着它是一个比较排序,在最坏的情况下,在每次迭代中花费 O(n log n) 时间。由于在最坏的情况下有 log n 次迭代,这使其成为 O(n (log n) 2) 时间算法。然而,排序可以通过使用两次桶排序来执行,因为我们用于排序比较的键(即前一步的词典名称)形成一个递增的整数序列。因此,这可以改进为用于后缀排序的实际 O(n log n) 时间算法。
Remark
评论
I believe this is the original algorithm for suffix array construction that was suggested in the 1992-paper by Manber and Myers (link on Google Scholar; it should be the first hit, and it may have a link to a PDF there). This (at the same time, but independently of a paper by Gonnet and Baeza-Yates) was what introduced suffix arrays (also known as pat arrays at the time) as a data structure interesting for further study.
我相信这是 Manber 和 Myers 在 1992 年的论文中建议的后缀数组构造的原始算法(Google Scholar 上的链接;它应该是第一个命中,它可能有指向 PDF 的链接)。这(同时,但独立于 Gonnet 和 Baeza-Yates 的一篇论文)引入了后缀数组(当时也称为 pat 数组)作为一种值得进一步研究的数据结构。
Modern algorithms for suffix array construction are O(n), so the above is no longer the best algorithm available (at least not in terms of theoretical, worst-case complexity).
后缀数组构造的现代算法是 O(n),因此以上不再是可用的最佳算法(至少在理论上的最坏情况复杂度方面不是)。
Footnotes
脚注
(*)By 2-gramI mean a sequence of two consecutivecharacters of the original string. For example, when S=abcde
is the string, then ab
, bc
, cd
, de
are the 2-grams of S
. Similarly, abcd
and bcde
are the 4-grams. Generally, an m-gram (for a positive integer m) is a sequence of m
consecutive characters. 1-grams are also called unigrams, 2-grams are called bigrams, 3-grams are called trigrams. Some people continue with tetragrams, pentagrams and so on.
(*)通过2-克我的意思是两个序列连续原始字符串的字符。例如,whenS=abcde
是字符串,则ab
、bc
、cd
、de
是 的 2 克S
。类似地,abcd
和bcde
是 4 克。通常,m-gram(对于正整数 m)是一个m
连续字符序列。1-grams 也称为unigrams,2-grams 称为bigrams,3-grams 称为trigrams。有些人继续使用四边形、五边形等。
Note that the suffix of S
that starts and position i
, is an (n-i)-gram of S
. Also, every m-gram (for any m) is a prefix of one of the suffixes of S
. Therefore, sorting m-grams (for an m as large as possible) can be the first step towards sorting suffixes.
请注意,S
开始和位置的后缀i
是 的 (ni)-gram S
。此外,每个 m-gram(对于任何 m)都是 的后缀之一的前缀S
。因此,对 m-gram 进行排序(对于尽可能大的 m)可能是对后缀进行排序的第一步。