C++ 字符串::查找复杂度

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/8869605/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 19:15:33  来源:igfitidea点击:

C++ string::find complexity

c++stringalgorithmsubstringtime-complexity

提问by Farzam

Why the c++'s implemented string::find()doesn't use the KMP algorithm(and doesn't run in O(N + M)) and runs in O(N * M)? Is that corrected in C++0x? If the complexity of current find is not O(N * M), what is that?

为什么 c++ 的实现string::find()不使用KMP 算法(并且不在 中运行O(N + M))而在 中运行O(N * M)?这在 C++0x 中得到纠正了吗?如果当前 find 的复杂性不是O(N * M),那是什么?

so what algorithm is implemented in gcc? is that KMP? if not, why? I've tested that and the running time shows that it runs in O(N * M)

那么gcc中实现了什么算法?那是KMP吗?如果不是,为什么?我已经测试过了,运行时间显示它在O(N * M)

回答by Mike Seymour

Why the c++'s implemented string::substr() doesn't use the KMP algorithm (and doesn't run in O(N + M)) and runs in O(N * M)?

为什么 c++ 实现的 string::substr() 不使用 KMP 算法(并且不在 O(N + M) 中运行)而在 O(N * M) 中运行?

I assume you mean find(), rather than substr()which doesn't need to search and should run in linear time (and only because it has to copy the result into a new string).

我假设你的意思是find(), 而不是substr()which 不需要搜索并且应该在线性时间内运行(并且只是因为它必须将结果复制到一个新字符串中)。

The C++ standard doesn't specify implementation details, and only specifies complexity requirements in some cases. The only complexity requirements on std::stringoperations are that size(), max_size(), operator[], swap(), c_str()and data()are all constant time. The complexity of anything else depends on the choices made by whoever implemented the library you're using.

C++标准没有规定实现细节,只规定了某些情况下的复杂性要求。上唯一的复杂性要求std::string操作是size()max_size()operator[]swap()c_str()data()都是恒定的时间。其他任何事情的复杂性取决于实现您正在使用的库的人所做的选择。

The most likely reason for choosing a simple search over something like KMP is to avoid needing extra storage. Unless the string to be found is very long, and the string to search contains a lot of partial matches, the time taken to allocate and free that would likely be much more than the cost of the extra complexity.

选择简单搜索而不是 KMP 之类的最可能的原因是避免需要额外的存储空间。除非要查找的字符串很长,并且要搜索的字符串包含很多部分匹配项,否则分配和释放所花费的时间可能会远远超过额外复杂性的成本。

Is that corrected in c++0x?

这在c ++ 0x中更正了吗?

No, C++11 doesn't add any complexity requirements to std::string, and certainly doesn't add any mandatory implementation details.

不,C++11 没有向 增加任何复杂性要求std::string,当然也没有增加任何强制性的实现细节。

If the complexity of current substr is not O(N * M), what is that?

如果当前 substr 的复杂度不是 O(N * M),那是什么?

That's the worst-case complexity, when the string to search contains a lot of long partial matches. If the characters have a reasonably uniform distribution, then the average complexity would be closer to O(N). So by choosing an algorithm with better worst-case complexity, you may well make more typical cases much slower.

这是最坏情况的复杂性,当要搜索的字符串包含大量长部分匹配时。如果字符具有合理的均匀分布,则平均复杂度将接近O(N)。因此,通过选择具有更好最坏情况复杂度的算法,您很可能会使更典型的情况变慢。

回答by Dietmar Kühl

Where do you get the impression from that std::string::substr()doesn't use a linear algorithm? In fact, I can't even imagine how to implement in a way which has the complexity you quoted. Also, there isn't much of an algorithm involved: is it possible that you are think this function does something else than it does? std::string::substr()just creates a new string starting at its first argument and using either the number of characters specified by the second parameter or the characters up to the end of the string.

你从哪里得到std::string::substr()不使用线性算法的印象?事实上,我什至无法想象如何以具有您引用的复杂性的方式实现。此外,没有涉及太多算法:您是否有可能认为该函数除此之外还做了其他事情?std::string::substr()只需从第一个参数开始创建一个新字符串,并使用第二个参数指定的字符数或直到字符串末尾的字符。

You may be referring to std::string::find()which doesn't have any complexity requirements or std::search()which is indeed allowed to do O(n * m) comparisons. However, this is a giving implementers the freedom to choose between an algorithm which has the best theoretical complexity vs. one which doesn't doesn't need additional memory. Since allocation of arbitrary amounts of memory is generally undesirable unless specifically requested, this seems a reasonable thing to do.

您可能指的是std::string::find()哪个没有任何复杂性要求,或者std::search()哪个确实允许进行 O(n * m) 比较。然而,这让实现者可以自由地在具有最佳理论复杂度的算法与不需要额外内存的算法之间进行选择。由于除非特别要求,否则通常不希望分配任意数量的内存,因此这似乎是合理的做法。

回答by A. K.

FYI, The string::find in both gcc/libstdc++ and llvm/libcxx were very slow. I improved both of them quite significantly (by ~20x in some cases). You might want to check the new implementation:

仅供参考,gcc/libstdc++ 和 llvm/libcxx 中的 string::find 非常慢。我非常显着地改进了它们(在某些情况下提高了约 20 倍)。您可能想要检查新的实现:

GCC: PR66414 optimize std::string::find https://github.com/gcc-mirror/gcc/commit/fc7ebc4b8d9ad7e2891b7f72152e8a2b7543cd65

GCC:PR66414 优化 std::string::find https://github.com/gcc-mirror/gcc/commit/fc7ebc4b8d9ad7e2891b7f72152e8a2b7543cd65

LLVM: https://reviews.llvm.org/D27068

LLVM:https: //reviews.llvm.org/D27068

The new algorithm is simpler and uses hand optimized assembly functions of memchr, and memcmp.

新算法更简单,并使用了 memchr 和 memcmp 的手工优化汇编函数。

回答by Yola

Let's look into the CLRS book. On the page 989 of third edition we have the following exercise:

让我们看看 CLRS 的书。在第三版的第 989 页上,我们有以下练习:

Suppose that pattern P and text T are randomly chosen strings of length m and n, respectively, from the d-ary alphabet Ʃd= {0; 1; ...; d}, where d >= 2. Show that the expected number of character-to-character comparisons made by the implicit loop in line 4 of the naive algorithm is enter image description here
over all executions of this loop. (Assume that the naive algorithm stops comparing characters for a given shift once it finds a mismatch or matches the entire pattern.) Thus, for randomly chosen strings, the naive algorithm is quite efficient.

假设模式 P 和文本 T 分别是从 d 元字母表 Ʃ d= {0;中随机选择的长度为 m 和 n 的字符串。1; ...; d},其中 d >= 2。表明在 该循环的所有执行过程中,由朴素算法第 4 行中的隐式循环进行的字符到字符比较的预期次数 。(假设朴素算法在发现不匹配或匹配整个模式后停止比较给定移位的字符。)因此,对于随机选择的字符串,朴素算法非常有效enter image description here

NAIVE-STRING-MATCHER(T,P)
1 n = T:length
2 m = P:length
3 for s = 0 to n - m
4     if P[1..m] == T[s+1..s+m]
5         print “Pattern occurs with shift” s

Proof:

证明:

For a single shift we are expected to perform 1 + 1/d + ... + 1/d^{m-1}comparisons. Now use summation formula and multiply by number of valid shifts, which is n - m + 1. □

对于单个班次,我们需要进行1 + 1/d + ... + 1/d^{m-1}比较。现在使用求和公式并乘以有效班次的次数,即n - m + 1。□

回答by paxdiablo

The C++ standard does not dictate the performance characteristics of substr(or many other parts, including the findyou're most likely referring to with an M*Ncomplexity).

C++ 标准没有规定substr(或许多其他部分,包括find您最有可能提到的M*N复杂性)的性能特征。

It mostly dictates functionalaspects of the language (with some exceptions like the non-legacy sortfunctions, for example).

它主要规定了语言的功能方面(例如,有一些例外,例如非遗留sort函数)。

Implementations are even free to implement qsortas a bubble sort (but only if they want to be ridiculed and possibly go out of business).

实现甚至可以自由地实现qsort为冒泡排序(但前提是它们想要被嘲笑并可能倒闭)。

For example, there are only seven (very small) sub-points in section 21.4.7.2 basic_string::findof C++11, and noneof them specify performance parameters.

例如,21.4.7.2 basic_string::findC++11 的部分只有七个(非常小的)子点,并且它们都没有指定性能参数。

回答by Borodin

Where do you get your information about the C++ library? If you do mean string::searchand it really doesn't use the KMP algorithm then I suggest that it is because that algorithm isn't generally faster than a simple linear search due to having to build a partial match table before the search can proceed.

您从哪里获得有关 C++ 库的信息?如果您的意思是string::search并且它确实不使用 KMP 算法,那么我建议这是因为该算法通常不会比简单的线性搜索快,因为必须在搜索进行之前构建部分匹配表。

回答by Nathan Blackerby

If you are going to be searching for the same pattern in multiple texts. The BoyerMoore algorithm is a good choice because the pattern tables need only be computed once , but are used multiple times when searching multiple texts. If you are only going to search for a pattern once in 1 text though, the overhead of computing the tables along with the overhead of allocating memory slows you down too much and std::string.find(....) will beat you since it does not allocate any memory and has no overhead. Boost has multiple string searching algorithms. I found that BM was an order of magnitude slower in the case of a single pattern search in 1 text than std::string.find(). For my cases BoyerMoore was rarely faster than std::string.find() even when searching multiple texts with the same pattern. Here is the link to BoyerMoore BoyerMoore

如果您要在多个文本中搜索相同的模式。BoyerMoore 算法是一个不错的选择,因为模式表只需要计算一次,但在搜索多个文本时会多次使用。但是,如果您只想在 1 个文本中搜索一次模式,那么计算表的开销以及分配内存的开销会使您的速度减慢太多,而 std::string.find(....) 会打败您因为它不分配任何内存并且没有开销。Boost 有多种字符串搜索算法。我发现在 1 个文本中进行单一模式搜索时,BM 比 std::string.find() 慢一个数量级。对于我的情况,即使在搜索具有相同模式的多个文本时,BoyerMoore 也很少比 std::string.find() 快。这是博耶摩尔的链接 博耶摩尔