C语言 最快的子串搜索算法是什么?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3183582/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What is the fastest substring search algorithm?
提问by R.. GitHub STOP HELPING ICE
OK, so I don't sound like an idiot I'm going to state the problem/requirements more explicitly:
好的,所以我听起来不像个白痴,我将更明确地说明问题/要求:
- Needle (pattern) and haystack (text to search) are both C-style null-terminated strings. No length information is provided; if needed, it must be computed.
- Function should return a pointer to the first match, or
NULLif no match is found. - Failure cases are not allowed. This means any algorithm with non-constant (or large constant) storage requirements will need to have a fallback case for allocation failure (and performance in the fallback care thereby contributes to worst-case performance).
- Implementation is to be in C, although a good description of the algorithm (or link to such) without code is fine too.
- Needle(模式)和 haystack(要搜索的文本)都是 C 风格的以空字符结尾的字符串。没有提供长度信息;如果需要,必须进行计算。
- 函数应该返回一个指向第一个匹配项的指针,或者
NULL如果没有找到匹配项。 - 不允许出现故障情况。这意味着任何具有非常量(或大常数)存储要求的算法都需要有分配失败的回退情况(因此回退护理中的性能有助于最坏情况的性能)。
- 实现是用 C 语言实现的,尽管没有代码的算法(或指向此类的链接)的良好描述也很好。
...as well as what I mean by "fastest":
...以及我所说的“最快”的意思:
- Deterministic
O(n)wheren= haystack length. (But it may be possible to use ideas from algorithms which are normallyO(nm)(for example rolling hash) if they're combined with a more robust algorithm to give deterministicO(n)results). - Never performs (measurably; a couple clocks for
if (!needle[1])etc. are okay) worse than the naive brute force algorithm, especially on very short needles which are likely the most common case. (Unconditional heavy preprocessing overhead is bad, as is trying to improve the linear coefficient for pathological needles at the expense of likely needles.) - Given an arbitrary needle and haystack, comparable or better performance (no worse than 50% longer search time) versus any other widely-implemented algorithm.
- Aside from these conditions, I'm leaving the definition of "fastest" open-ended. A good answer should explain why you consider the approach you're suggesting "fastest".
- 确定性
O(n)其中n= 干草堆长度。(但是,如果将通常的算法O(nm)(例如滚动哈希)与更强大的算法相结合以给出确定性O(n)结果,则可以使用这些算法的想法)。 - 永远不会
if (!needle[1])比朴素的蛮力算法执行(可测量;几个时钟等等)更糟糕,尤其是在很短的针上,这可能是最常见的情况。(无条件繁重的预处理开销是不好的,就像试图以牺牲可能的针头为代价来提高病理针头的线性系数一样。) - 给定一个任意的针头和大海捞针,与任何其他广泛实施的算法相比,性能相当或更好(搜索时间不超过 50%)。
- 除了这些条件,我将“最快”的定义保持开放。一个好的答案应该解释为什么您认为您建议的方法是“最快的”。
My current implementation runs in roughly between 10% slower and 8 times faster (depending on the input) than glibc's implementation of Two-Way.
我当前的实现运行速度大约比 glibc 的双向实现慢 10% 和快 8 倍(取决于输入)。
Update: My current optimal algorithm is as follows:
更新:我目前的最优算法如下:
- For needles of length 1, use
strchr. - For needles of length 2-4, use machine words to compare 2-4 bytes at once as follows: Preload needle in a 16- or 32-bit integer with bitshifts and cycle old byte out/new bytes in from the haystack at each iteration. Every byte of the haystack is read exactly once and incurs a check against 0 (end of string) and one 16- or 32-bit comparison.
- For needles of length >4, use Two-Way algorithm with a bad shift table (like Boyer-Moore) which is applied only to the last byte of the window. To avoid the overhead of initializing a 1kb table, which would be a net loss for many moderate-length needles, I keep a bit array (32 bytes) marking which entries in the shift table are initialized. Bits that are unset correspond to byte values which never appear in the needle, for which a full-needle-length shift is possible.
- 对于长度为 1 的针,请使用
strchr。 - 对于长度为 2-4 的针,使用机器字一次比较 2-4 个字节,如下所示:将针预加载到 16 位或 32 位整数中并进行位移,并在每次迭代时从干草堆中循环旧字节出/新字节. haystack 的每个字节都被精确读取一次,并针对 0(字符串结尾)和一个 16 位或 32 位比较进行检查。
- 对于长度 > 4 的针,使用带有错误移位表(如 Boyer-Moore)的双向算法,该算法仅应用于窗口的最后一个字节。为了避免初始化一个 1kb 表的开销,这对于许多中等长度的针来说是一个净损失,我保留了一个位数组(32 字节)来标记移位表中的哪些条目被初始化。未设置的位对应于从未出现在针中的字节值,对于这些值可以进行全针长移位。
The big questions left in my mind are:
留在我脑海中的大问题是:
- Is there a way to make better use of the bad shift table? Boyer-Moore makes best use of it by scanning backwards (right-to-left) but Two-Way requires a left-to-right scan.
- The only two viable candidate algorithms I've found for the general case (no out-of-memory or quadratic performance conditions) are Two-Wayand String Matching on Ordered Alphabets. But are there easily-detectable cases where different algorithms would be optimal? Certainly many of the
O(m)(wheremis needle length) in space algorithms could be used form<100or so. It would also be possible to use algorithms which are worst-case quadratic if there's an easy test for needles which provably require only linear time.
- 有没有办法更好地利用坏班次表?Boyer-Moore 通过向后扫描(从右到左)来充分利用它,但双向扫描需要从左到右扫描。
- 对于一般情况(没有内存不足或二次性能条件),我发现的唯一两个可行的候选算法是Two-Way和String Matching on Ordered Alphabets。但是,是否存在易于检测的情况,其中不同的算法是最佳的?当然,空间算法中的许多
O(m)(其中m是针长度)可以用于m<100左右。如果有一个简单的针测试,证明只需要线性时间,那么也可以使用最坏情况二次方的算法。
Bonus points for:
奖励积分:
- Can you improve performance by assuming the needle and haystack are both well-formed UTF-8? (With characters of varying byte lengths, well-formed-ness imposes some string alignment requirements between the needle and haystack and allows automatic 2-4 byte shifts when a mismatching head byte is encountered. But do these constraints buy you much/anything beyond what maximal suffix computations, good suffix shifts, etc. already give you with various algorithms?)
- 您能否通过假设 Needle 和 haystack 都是格式良好的 UTF-8 来提高性能?(对于不同字节长度的字符,格式良好在指针和 haystack 之间强加了一些字符串对齐要求,并在遇到不匹配的头字节时允许自动 2-4 个字节移位。但是这些约束是否会给你带来很多/超出什么?最大后缀计算、良好的后缀转换等已经为您提供了各种算法?)
Note:I'm well aware of most of the algorithms out there, just not how well they perform in practice. Here's a good reference so people don't keep giving me references on algorithms as comments/answers: http://www-igm.univ-mlv.fr/~lecroq/string/index.html
注意:我很清楚那里的大多数算法,只是不知道它们在实践中的表现如何。这是一个很好的参考,所以人们不要一直给我作为评论/答案的算法参考:http: //www-igm.univ-mlv.fr/~lecroq/string/index.html
回答by drawnonward
Build up a test library of likely needles and haystacks. Profile the tests on several search algorithms, including brute force. Pick the one that performs best with your data.
建立一个可能的针头和大海捞针的测试库。对几种搜索算法的测试进行概要分析,包括蛮力。选择对您的数据表现最佳的一种。
Boyer-Mooreuses a bad character table with a good suffix table.
Boyer-Moore使用坏字符表和好的后缀表。
Boyer-Moore-Horspooluses a bad character table.
Boyer-Moore-Horspool使用坏字符表。
Knuth-Morris-Prattuses a partial match table.
Knuth-Morris-Pratt使用部分匹配表。
Rabin-Karpuses running hashes.
Rabin-Karp使用运行哈希。
They all trade overhead for reduced comparisons to a different degree, so the real world performance will depend on the average lengths of both the needle and haystack. The more initial overhead, the better with longer inputs. With very short needles, brute force may win.
它们都在不同程度上为减少比较而交换开销,因此现实世界的性能将取决于针和干草堆的平均长度。初始开销越多,输入越长越好。使用非常短的针头,蛮力可能会获胜。
Edit:
编辑:
A different algorithm might be best for finding base pairs, english phrases, or single words. If there were one best algorithm for all inputs, it would have been publicized.
不同的算法可能最适合查找碱基对、英语短语或单个单词。如果所有输入都有一种最佳算法,它就会被公开。
Think about the following little table. Each question mark might have a different best search algorithm.
想想下面的小桌子。每个问号可能有不同的最佳搜索算法。
short needle long needle
short haystack ? ?
long haystack ? ?
This should really be a graph, with a range of shorter to longer inputs on each axis. If you plotted each algorithm on such a graph, each would have a different signature. Some algorithms suffer with a lot of repetition in the pattern, which might affect uses like searching for genes. Some other factors that affect overall performance are searching for the same pattern more than once and searching for different patterns at the same time.
这应该是一个图表,每个轴上都有一系列较短到较长的输入。如果在这样的图上绘制每个算法,每个算法都会有不同的签名。一些算法在模式中存在大量重复,这可能会影响搜索基因等用途。影响整体性能的其他一些因素是多次搜索相同的模式并同时搜索不同的模式。
If I needed a sample set, I think I would scrape a site like google or wikipedia, then strip the html from all the result pages. For a search site, type in a word then use one of the suggested search phrases. Choose a few different languages, if applicable. Using web pages, all the texts would be short to medium, so merge enough pages to get longer texts. You can also find public domain books, legal records, and other large bodies of text. Or just generate random content by picking words from a dictionary. But the point of profiling is to test against the type of content you will be searching, so use real world samples if possible.
如果我需要一个样本集,我想我会抓取像 google 或 wikipedia 这样的网站,然后从所有结果页面中删除 html。对于搜索站点,输入一个词,然后使用建议的搜索短语之一。如果适用,请选择几种不同的语言。使用网页,所有的文本都是短到中等的,所以合并足够多的页面以获得更长的文本。您还可以找到公共领域的书籍、法律记录和其他大量文本。或者只是通过从字典中挑选单词来生成随机内容。但分析的重点是针对您将搜索的内容类型进行测试,因此如果可能,请使用真实世界的样本。
I left short and long vague. For the needle, I think of short as under 8 characters, medium as under 64 characters, and long as under 1k. For the haystack, I think of short as under 2^10, medium as under a 2^20, and long as up to a 2^30 characters.
我留下了简短和长期的模糊。对于针,我认为短的是8个字符以下,中的是64个字符以下,长的是1k以下。对于干草堆,我认为短是在 2^10 以下,中是在 2^20 以下,最长是 2^30 个字符。
回答by user541686
Published in 2011, I believe it may very well be the "Simple Real-Time Constant-Space String Matching"algorithm by Dany Breslauer, Roberto Grossi, and Filippo Mignosi.
发表于 2011 年,我相信它很可能是 Dany Breslauer、Roberto Grossi 和 Filippo Mignosi的“简单实时常量空间字符串匹配”算法。
Update:
更新:
In 2014 the authors published this improvement: Towards optimal packed string matching.
在 2014 年,作者发表了这一改进:Towardsoptimal Packed String matching。
回答by NealB
The http://www-igm.univ-mlv.fr/~lecroq/string/index.htmllink you point to is an excellent source and summary of some of the best known and researched string matching algorithms.
您指向的http://www-igm.univ-mlv.fr/~lecroq/string/index.html链接是一些最知名和研究过的字符串匹配算法的极好来源和摘要。
Solutions to most search problems involve trade offs with respect to pre-processing overhead, time and space requirements. No single algorithm will be optimal or practical in all cases.
大多数搜索问题的解决方案涉及预处理开销、时间和空间要求方面的权衡。没有一种算法在所有情况下都是最优的或实用的。
If you objective is to design a specific algorithm for string searching, then ignore the rest of what I have to say, If you want to develop a generalized string searching service routine then try the following:
如果您的目标是设计一个特定的字符串搜索算法,那么忽略我要说的其余部分,如果您想开发一个通用的字符串搜索服务例程,请尝试以下操作:
Spend some time reviewing the specific strengths and weaknesses of the algorithms you have already referenced. Conduct the review with the objective of finding a set of algorithms that cover the range and scope of string searches you are interested in. Then, build a front end search selector based on a classifier function to target the best algorithm for the given inputs. This way you may employ the most efficient algorithm to do the job. This is particularly effective when an algorithm is very good for certain searches but degrades poorly. For example, brute force is probably the best for needles of length 1 but quickly degrades as needle length increases, whereupon the sustik-moore algoritimmay become more efficient (over small alphabets), then for longer needles and larger alphabets, the KMP or Boyer-Moore algorithms may be better. These are just examples to illustrate a possible strategy.
花一些时间回顾一下您已经引用的算法的具体优势和劣势。进行的目的是找到一组涵盖您感兴趣的字符串搜索范围的算法。然后,基于分类器函数构建前端搜索选择器,以针对给定输入定位最佳算法。这样您就可以使用最有效的算法来完成这项工作。当算法对某些搜索非常好但降级很差时,这尤其有效。例如,对于长度为 1 的针,蛮力可能是最好的,但随着针长度的增加会迅速降低,因此sustik-moore 算法可能会变得更有效(在小字母表上),然后对于更长的针和更大的字母表,KMP 或 Boyer-Moore 算法可能会更好。这些只是说明可能策略的示例。
The multiple algorithm approach not a new idea. I believe it has been employed by a few commercial Sort/Search packages (e.g. SYNCSORT commonly used on mainframes implements several sort algorithms and uses heuristics to choose the "best" one for the given inputs)
多算法方法并不是一个新想法。我相信它已经被一些商业排序/搜索包所采用(例如,大型机上常用的 SYNCSORT 实现了几种排序算法,并使用启发式为给定的输入选择“最佳”算法)
Each search algorithm comes in several variations that can make significant differences to its performance, as, for example, this paperillustrates.
每种搜索算法都有几种变体,它们可能对其性能产生重大影响,例如,本文说明了这一点。
Benchmark your service to categorize the areas where additional search strategies are needed or to more effectively tune your selector function. This approach is not quick or easy but if done well can produce very good results.
对您的服务进行基准测试以对需要额外搜索策略的区域进行分类或更有效地调整您的选择器功能。这种方法既不快速也不简单,但如果做得好可以产生非常好的结果。
回答by Matyas
I was surprised to see our tech report cited in this discussion; I am one of the authors of the algorithm that was named Sustik-Moore above. (We did not use that term in our paper.)
我很惊讶地看到本次讨论中引用了我们的技术报告;我是上面名为 Sustik-Moore 的算法的作者之一。(我们没有在我们的论文中使用这个术语。)
I wanted here to emphasize that for me the most interesting feature of the algorithm is that it is quite simple to prove that each letter is examined at most once. For earlier Boyer-Moore versions they proved that each letter is examined at most 3 and later 2 times at most, and those proofs were more involved (see cites in paper). Therefore I also see a didactical value in presenting/studying this variant.
我想在这里强调,对我来说,该算法最有趣的特点是证明每个字母最多检查一次是非常简单的。对于较早的 Boyer-Moore 版本,他们证明每个字母最多检查 3 次,之后最多检查 2 次,而且这些证明涉及更多(参见论文中的引用)。因此,我也看到了展示/研究这个变体的教学价值。
In the paper we also describe further variations that are geared toward efficiency while relaxing the theoretical guarantees. It is a short paper and the material should be understandable to an average high school graduate in my opinion.
在论文中,我们还描述了在放宽理论保证的同时面向效率的进一步变化。这是一篇简短的论文,在我看来,普通高中毕业生应该可以理解其中的材料。
Our main goal was to bring this version to the attention of others who can further improve on it. String searching has so many variations and we alone cannot possibly think of all where this idea could bring benefits. (Fixed text and changing pattern, fixed pattern different text, preprocessing possible/not possible, parallel execution, finding matching subsets in large texts, allow errors, near matches etc., etc.)
我们的主要目标是让其他可以进一步改进它的人注意到这个版本。字符串搜索有很多变化,我们自己不可能想到这个想法可以带来好处的所有地方。(固定文本和更改模式、固定模式不同的文本、可能/不可能的预处理、并行执行、在大文本中查找匹配子集、允许错误、接近匹配等等)
回答by JDiMatteo
The fastest substring search algorithm is going to depend on the context:
最快的子串搜索算法将取决于上下文:
- the alphabet size (e.g. DNA vs English)
- the needle length
- 字母大小(例如 DNA 与英语)
- 针长
The 2010 paper "The Exact String Matching Problem: a Comprehensive Experimental Evaluation"gives tables with runtimes for 51 algorithms (with different alphabet sizes and needle lengths), so you can pick the best algorithm for your context.
2010 年的论文“精确字符串匹配问题:综合实验评估”提供了包含 51 种算法(具有不同字母大小和针长)的运行时的表格,因此您可以为您的上下文选择最佳算法。
All of those algorithms have C implementations, as well as a test suite, here:
所有这些算法都有 C 实现,以及一个测试套件,在这里:
回答by user172818
A really good question. Just add some tiny bits...
一个很好的问题。只需添加一些小点...
Someone were talking about DNA sequence matching. But for DNA sequence, what we usually do is to build a data structure (e.g. suffix array, suffix tree or FM-index) for the haystack and match many needles against it. This is a different question.
It would be really great if someone would like to benchmark various algorithms. There are very good benchmarks on compression and the construction of suffix arrays, but I have not seen a benchmark on string matching. Potential haystack candidates could be from the SACA benchmark.
A few days ago I was testing the Boyer-Moore implementation from the page you recommended(EDIT: I need a function call like memmem(), but it is not a standard function, so I decided to implement it). My benchmarking program uses random haystack. It seems that the Boyer-Moore implementation in that page is times faster than glibc's memmem() and Mac's strnstr(). In case you are interested, the implementation is hereand the benchmarking code is here. This is definitely not a realistic benchmark, but it is a start.
有人在谈论 DNA 序列匹配。但是对于DNA序列,我们通常做的是为大海捞针建立一个数据结构(例如后缀数组,后缀树或FM-index),并对其进行匹配。这是一个不同的问题。
如果有人想对各种算法进行基准测试,那就太好了。关于压缩和后缀数组的构建有很好的基准测试,但我还没有看到字符串匹配的基准测试。潜在的大海捞针候选人可能来自SACA 基准。
几天前,我正在测试您推荐的页面中的 Boyer-Moore 实现(编辑:我需要一个像 memmem() 这样的函数调用,但它不是一个标准函数,所以我决定实现它)。我的基准测试程序使用随机干草堆。似乎该页面中的 Boyer-Moore 实现比 glibc 的 memmem() 和 Mac 的 strnstr() 快几倍。如果你有兴趣,实现是在这里和基准测试代码是在这里。这绝对不是一个现实的基准,但它是一个开始。
回答by Timothy Jones
I know it's an old question, but most bad shift tables are single character. If it makes sense for your dataset (eg especially if it's written words), and if you have the space available, you can get a dramatic speedup by using a bad shift table made of n-grams rather than single characters.
我知道这是一个老问题,但大多数糟糕的班次表都是单字符。如果它对您的数据集有意义(例如,特别是如果它是书面文字),并且如果您有可用空间,则可以通过使用由 n-gram 而不是单个字符组成的糟糕移位表来获得显着的加速。
回答by Conrad Meyer
Use stdlib strstr:
使用标准库strstr:
char *foundit = strstr(haystack, needle);
It was very fast, only took me about 5 seconds to type.
它非常快,打字只花了我大约 5 秒钟。
回答by Matt Joiner
Here's Python's search implementation, used from throughout the core. The comments indicate it uses a compressed boyer-moore delta 1 table.
这是Python 的搜索实现,在整个核心中使用。评论表明它使用压缩的 boyer-moore delta 1 table。
I have done some pretty extensive experimentation with string searching myself, but it was for multiple search strings. Assembly implementations of Horspooland Bitapcan often hold their own against algorithms like Aho-Corasickfor low pattern counts.
我自己对字符串搜索做了一些非常广泛的实验,但它是针对多个搜索字符串的。Horspool和Bitap 的汇编实现通常可以与Aho-Corasick等算法相抗衡,以实现低模式计数。
回答by johne
A faster "Search for a single matching character" (ala strchr) algorithm.
更快的“搜索单个匹配字符”(ala strchr)算法。
Important notes:
重要笔记:
These functions use a "number / count of (leading|trailing) zeros"
gcccompiler intrinsic-__builtin_ctz. These functions are likely to only be fast on machines that have an instruction(s) that perform this operation (i.e., x86, ppc, arm).These functions assume the target architecture can perform 32 and 64 bit unaligned loads. If your target architecture does not support this, you will need to add some start up logic to properly align the reads.
These functions are processor neutral. If the target CPU has vector instructions, you might be able to do (much) better. For example, The
strlenfunction below uses SSE3 and can be trivially modified to XOR the bytes scanned to look for a byte other than0. Benchmarks performed on a 2.66GHz Core 2 laptop running Mac OS X 10.6 (x86_64) :- 843.433 MB/s for
strchr - 2656.742 MB/s for
findFirstByte64 - 13094.479 MB/s for
strlen
- 843.433 MB/s for
这些函数使用“(前导|尾随)零的数量/计数”
gcc编译器内在 -__builtin_ctz。这些函数可能只在具有执行此操作的指令(即 x86、ppc、arm)的机器上运行得很快。这些函数假设目标架构可以执行 32 位和 64 位未对齐加载。如果您的目标架构不支持这一点,您将需要添加一些启动逻辑来正确对齐读取。
这些功能与处理器无关。如果目标 CPU 具有向量指令,您可能会做得(好得多)。例如,
strlen下面的函数使用 SSE3,可以简单地修改为 XOR 扫描的字节以查找除0. 在运行 Mac OS X 10.6 (x86_64) 的 2.66GHz Core 2 笔记本电脑上执行的基准测试:- 843.433 MB/s
strchr - 2656.742 MB/s
findFirstByte64 - 13094.479 MB/s
strlen
- 843.433 MB/s
... a 32-bit version:
... 32 位版本:
#ifdef __BIG_ENDIAN__
#define findFirstZeroByte32(x) ({ uint32_t _x = (x); _x = ~(((_x & 0x7F7F7F7Fu) + 0x7F7F7F7Fu) | _x | 0x7F7F7F7Fu); (_x == 0u) ? 0 : (__builtin_clz(_x) >> 3) + 1; })
#else
#define findFirstZeroByte32(x) ({ uint32_t _x = (x); _x = ~(((_x & 0x7F7F7F7Fu) + 0x7F7F7F7Fu) | _x | 0x7F7F7F7Fu); (__builtin_ctz(_x) + 1) >> 3; })
#endif
unsigned char *findFirstByte32(unsigned char *ptr, unsigned char byte) {
uint32_t *ptr32 = (uint32_t *)ptr, firstByte32 = 0u, byteMask32 = (byte) | (byte << 8);
byteMask32 |= byteMask32 << 16;
while((firstByte32 = findFirstZeroByte32((*ptr32) ^ byteMask32)) == 0) { ptr32++; }
return(ptr + ((((unsigned char *)ptr32) - ptr) + firstByte32 - 1));
}
... and a 64-bit version:
...和 64 位版本:
#ifdef __BIG_ENDIAN__
#define findFirstZeroByte64(x) ({ uint64_t _x = (x); _x = ~(((_x & 0x7F7F7F7F7f7f7f7full) + 0x7F7F7F7F7f7f7f7full) | _x | 0x7F7F7F7F7f7f7f7full); (_x == 0ull) ? 0 : (__builtin_clzll(_x) >> 3) + 1; })
#else
#define findFirstZeroByte64(x) ({ uint64_t _x = (x); _x = ~(((_x & 0x7F7F7F7F7f7f7f7full) + 0x7F7F7F7F7f7f7f7full) | _x | 0x7F7F7F7F7f7f7f7full); (__builtin_ctzll(_x) + 1) >> 3; })
#endif
unsigned char *findFirstByte64(unsigned char *ptr, unsigned char byte) {
uint64_t *ptr64 = (uint64_t *)ptr, firstByte64 = 0u, byteMask64 = (byte) | (byte << 8);
byteMask64 |= byteMask64 << 16;
byteMask64 |= byteMask64 << 32;
while((firstByte64 = findFirstZeroByte64((*ptr64) ^ byteMask64)) == 0) { ptr64++; }
return(ptr + ((((unsigned char *)ptr64) - ptr) + firstByte64 - 1));
}
Edit 2011/06/04The OP points out in the comments that this solution has a "insurmountable bug":
编辑 2011/06/04OP 在评论中指出该解决方案有一个“无法克服的错误”:
it can read past the sought byte or null terminator, which could access an unmapped page or page without read permission. You simply cannot use large reads in string functions unless they're aligned.
它可以读取所寻求的字节或空终止符,这可以访问未映射的页面或未经读取许可的页面。除非对齐,否则您根本无法在字符串函数中使用大量读取。
This is technically true, but applies to virtually any algorithm that operates on chunks that are larger than a single byte, including the methodsuggested by the OP in the comments:
这在技术上是正确的,但几乎适用于任何对大于单个字节的块进行操作的算法,包括OP 在评论中建议的方法:
A typical
strchrimplementation is not naive, but quite a bit more efficient than what you gave. See the end of this for the most widely used algorithm: http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord
典型的
strchr实现并不幼稚,但比您提供的要高效得多。有关最广泛使用的算法,请参见本文末尾:http: //graphics.stanford.edu/~seander/bithacks.html#ZeroInWord
It also really has nothing to do with alignmentper-se. True, this could potentially cause the behavior discussed on the majority of common architectures in use, but this has more to do with microarchitecture implementation details- if the unaligned read straddles a 4K boundary (again, typical), then that read will cause a program terminating fault if the next 4K page boundary is unmapped.
它也与对齐本身无关。确实,这可能会导致在使用中的大多数常见架构上讨论的行为,但这更多地与微架构实现细节有关 - 如果未对齐的读取跨越 4K 边界(再次,典型),那么该读取将导致程序如果下一个 4K 页边界未映射,则终止错误。
But this isn't a "bug" in the algorithm given in the answer- that behavior is because functions like strchrand strlendo not accept a lengthargument to bound the size of the search. Searching char bytes[1] = {0x55};, which for the purposes of our discussion just so happens to be placed at the very end of a 4K VM page boundary and the next page is unmapped, with strchr(bytes, 0xAA)(where strchris a byte-at-a-time implementation) will crash exactly the same way. Ditto for strchrrelated cousin strlen.
但这不是答案中给出的算法中的“错误”——这种行为是因为函数类似于strchr并且strlen不接受length限制搜索大小的参数。Searching char bytes[1] = {0x55};,出于我们讨论的目的,它恰好被放置在 4K VM 页面边界的最末端,并且下一页未映射,strchr(bytes, 0xAA)(其中strchr是一次字节的实现)将完全崩溃同样的方法。同上strchr表亲strlen。
Without a lengthargument, there is no way to tell when you should switch out of the high speed algorithm and back to a byte-by-byte algorithm. A much more likely "bug" would be to read "past the size of the allocation", which technically results in undefined behavioraccording to the various C language standards, and would be flagged as an error by something like valgrind.
如果没有length参数,就无法判断何时应该从高速算法切换回逐字节算法。一个更可能的“错误”是读取“超过分配的大小”,这在技术上会undefined behavior根据各种 C 语言标准产生,并且会被诸如valgrind.
In summary, anything that operates on larger than byte chunks to go faster, as this answers code does and the code pointed out by the OP, but must have byte-accurate read semantics is likely to be "buggy" if there is no lengthargument to control the corner case(s) of "the last read".
总而言之,任何在大于字节块上运行的东西都可以更快地运行,因为这可以回答代码和 OP 指出的代码,但必须具有字节精确的读取语义,如果没有length参数控制“最后一次阅读”的极端情况。
The code in this answer is a kernel for being able to find the first byte in a natural CPU word size chunk quickly if the target CPU has a fast ctzlike instruction. It is trivial to add things like making sure it only operates on correctly aligned natural boundaries, or some form of lengthbound, which would allow you to switch out of the high speed kernel and in to a slower byte-by-byte check.
此答案中的代码是一个内核,如果目标 CPU 具有快速ctz指令,则能够快速找到自然 CPU 字大小块中的第一个字节。添加诸如确保它仅在正确对齐的自然边界或某种形式的边界上运行之类的事情是微不足道的length,这将允许您从高速内核切换到较慢的逐字节检查。
The OP also states in the comments:
OP还在评论中指出:
As for your ctz optimization, it only makes a difference for the O(1) tail operation. It could improve performance with tiny strings (e.g.
strchr("abc", 'a');but certainly not with strings of any major size.
至于你的 ctz 优化,它只对 O(1) 尾操作有影响。它可以提高小字符串的性能(例如,
strchr("abc", 'a');但肯定不能使用任何大尺寸的字符串。
Whether or not this statement is true depends a great deal on the microarchitecture in question. Using the canonical 4 stage RISC pipeline model, then it is almost certainly true. But it is extremely hard to tell if it is true for a contemporary out-of-order super scalar CPU where the core speed can utterly dwarf the memory streaming speed. In this case, it is not only plausible, but quite common, for there to be a large gap in "the number of instructions that can be retired" relative to "the number of bytes that can be streamed" so that you have "the number of instructions that can be retired for each byte that can be streamed". If this is large enough, the ctz+ shift instruction can be done "for free".
这种说法是否正确在很大程度上取决于所讨论的微体系结构。使用规范的 4 阶段 RISC 流水线模型,那么它几乎可以肯定是正确的。但是,对于现代无序超级标量 CPU 而言,核心速度可以完全使内存流速度相形见绌,这很难说是真的。在这种情况下,“可以退出的指令数量”与“可以流式传输的字节数”之间存在很大差距,因此您拥有“可以为每个可以流式传输的字节退役的指令数”。如果这足够大,ctz+ shift 指令可以“免费”完成。

