C++ 比有序列表的二分搜索快
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4057258/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Faster than binary search for ordered list
提问by uray
is there an algorithm that is faster than binary search, for searching in sorted values of array?
是否有比二进制搜索更快的算法,用于搜索数组的排序值?
in my case, I have a sorted values (could be any type values) in an A
array, I need to return n
if the value I was looking is in range of A[n] and A[n+1]
就我而言,我在A
数组中有一个排序的值(可以是任何类型的值),n
如果我正在查找的值在范围内,我需要返回A[n] and A[n+1]
回答by jonderry
You can do better than O(log n) if the values are integers, in which case the best worst-case running time you can achieve, in terms of n, is O(sqrt(log n)). Otherwise, there is no way to beat O(log n) unless there are patterns in the input sequence. There are two approaches used to beat O(log n) in the case of integers.
如果值是整数,你可以比 O(log n) 做得更好,在这种情况下,你可以达到的最佳最坏情况运行时间,就 n 而言,是 O(sqrt(log n))。否则,除非输入序列中存在模式,否则无法击败 O(log n)。在整数的情况下,有两种方法可以击败 O(log n)。
First, you can use y-fast trees which work by storing in a hash table all prefixes for which you are storing at least one integer with that prefix. This enables you to perform a binary search to find the length of the longest matching prefix. This enables you to find the successor of an element for which you are searching in time O(log w) where w is the number of bits in a word. There are some details to work though to make this work and use only linear space, but they aren't too bad (see the link below).
首先,您可以使用 y-fast 树,它的工作方式是将所有前缀存储在哈希表中,您至少要为其存储一个带有该前缀的整数。这使您能够执行二进制搜索以查找最长匹配前缀的长度。这使您能够在 O(log w) 时间内找到您正在搜索的元素的后继元素,其中 w 是一个字中的位数。有一些细节可以使这项工作正常工作并且仅使用线性空间,但它们还不错(请参阅下面的链接)。
Second, you can use fusion trees, which use bit tricks to enable you to perform w^O(1) comparisons in just a constant number of instructions, yielding a running time of O(log n / log w).
其次,您可以使用融合树,它使用一些技巧使您能够在恒定数量的指令中执行 w^O(1) 比较,从而产生 O(log n / log w) 的运行时间。
The optimum tradeoff between these two data structures occurs when log w = sqrt(log n), giving a running time of O(sqrt(log n)).
当 log w = sqrt(log n),运行时间为 O(sqrt(log n)) 时,这两种数据结构之间的最佳折衷发生。
For details on the above, see lectures 12 and 13 of Erik Demaine's course: http://courses.csail.mit.edu/6.851/spring07/lec.html
有关上述内容的详细信息,请参阅 Erik Demaine 课程的第 12 和 13 课:http: //courses.csail.mit.edu/6.851/spring07/lec.html
回答by xscott
One possibility is to treat it like finding the roots of a function. Basically, finding:
一种可能性是将其视为寻找函数的根。基本上,发现:
a[i] <= i <= a[i + 1]
Is equivalent to:
相当于:
a[i] - i <= 0 <= a[i + 1] - i
Then you could try something like Newton's method and so on. These kinds of algorithms frequently converge faster than a binary search when they work, but I don't know of one that is guaranteed to converge for all input.
然后你可以尝试一些类似牛顿的方法等等。这些类型的算法在工作时通常比二分搜索收敛得更快,但我不知道有一种算法能保证对所有输入收敛。
回答by Ignacio Vazquez-Abrams
If the values in the list are evenly distributed then you could try a weighted split instead of a binary split, e.g. if the desired value is a third of the way from the current lower limit to the current value then you could try the element that is also a third of the way. This could suffer badly on lists where values are bunched up though.
如果列表中的值均匀分布,那么您可以尝试加权拆分而不是二进制拆分,例如,如果所需的值是从当前下限到当前值的三分之一,那么您可以尝试也是三分之一。但是,在值聚集在一起的列表上,这可能会受到严重影响。
回答by Ben Voigt
Yes and no. Yes there are searches that are faster, on average, than a bisection search. But I believe that they are still O(lg N), just with a lower constant.
是和否。是的,平均而言,有些搜索比二分搜索更快。但我相信它们仍然是 O(lg N),只是常数较低。
You want to minimize the time taken to find your element. Generally it is desirable to use fewer steps, and one way to approach this is to maximize the expected number of elements that will be eliminated at each step. With bisection, always exactly half the elements are eliminated. You can do better than this, IF you know something about the distribution of the elements. But, the algorithm for choosing the partition element is generally more complicated than choosing the midpoint, and this extra complexity may overwhelm any time savings you expected to get from using fewer steps.
您希望最大限度地减少查找元素所需的时间。通常希望使用较少的步骤,而解决此问题的一种方法是最大化将在每个步骤中消除的元素的预期数量。使用二分法,总是正好消除一半的元素。如果您对元素的分布有所了解,您可以做得比这更好。但是,选择分区元素的算法通常比选择中点更复杂,这种额外的复杂性可能会压倒您希望通过使用更少的步骤获得的任何时间节省。
Really, in a problem like this it's better to attack second-order effects like cache locality, than the search algorithm. For example, when doing a repeated binary search, the same few elements (first, second, and third quartiles) are used VERY frequently, so putting them in a single cache line could be far superior to random access into the list.
确实,在这样的问题中,与搜索算法相比,攻击缓存局部性等二阶效应更好。例如,在进行重复二分搜索时,相同的少数元素(第一、第二和第三四分位数)被非常频繁地使用,因此将它们放在单个缓存行中可能比随机访问列表要好得多。
Dividing each level into say 4 or 8 equal sections (instead of 2) and doing a linear search through those could also be quicker than the bisection search, because a linear search doesn't require calculating the partition and also has fewer data dependencies that can cause cache stalls.
将每个级别划分为 4 或 8 个相等的部分(而不是 2 个)并通过这些进行线性搜索也可能比二分搜索更快,因为线性搜索不需要计算分区并且还具有较少的数据依赖性,可以导致缓存停顿。
But all of these are still O(lg N).
但所有这些仍然是 O(lg N)。
回答by user2747438
What about the following algo? it is called Exponential Search and is one of the variations of binary search. http://en.m.wikipedia.org/wiki/Exponential_search
下面的算法呢?它被称为指数搜索,是二分搜索的变体之一。 http://en.m.wikipedia.org/wiki/Exponential_search
Searching for element k in sorted array A of size n. Lookup A[2^i] for i=0, 1, 2,... until you go beyond k's position in A. then do a binary search on the part of the array left (smaller) than i.
在大小为 n 的排序数组 A 中搜索元素 k。查找 A[2^i] for i=0, 1, 2,... 直到超出 k 在 A 中的位置。然后对数组左侧(小于)i 的部分进行二分搜索。
int exponential_search(int A[], int key)
{
// lower and upper bound for binary search
int lower_bound = 0;
int upper_bound = 1;
// calculate lower and upper bound
while (A[upper_bound] < key) {
lower_bound = upper_bound;
upper_bound = upper_bound * 2;
}
return binary_search(A, key, lower_bound, upper_bound);
}
This algo will run on O(log idx) where idx is the index of k in A. (both stpes are in log idx). In the worst case, the algo is in O(log idx), if k is amongst the largest elements of A or bigger than any element of A. The multiplicative constant is larger than for binary search but the algo would run faster for very large arrays and when looking for data that's towards the beginning of the array.
该算法将在 O(log idx) 上运行,其中 idx 是 A 中 k 的索引(两个 stpes 都在 log idx 中)。在最坏的情况下,算法在 O(log idx) 中,如果 k 是 A 的最大元素之一或大于 A 的任何元素。乘法常数大于二分搜索,但算法会在非常大的情况下运行得更快数组以及查找接近数组开头的数据时。
I'D like to have some idea of the minimal size n where this algo becomes preferable to binary search, but I don't know.
我想知道这个算法比二分搜索更可取的最小大小 n ,但我不知道。
回答by Fabio
Although in the general case you cannot do better than O(log N), you can at least optimize that, thus significantly reducing the constant of proportionality in front of O(log N).
虽然在一般情况下你不能比 O(log N) 做得更好,但你至少可以优化它,从而显着减少 O(log N) 前面的比例常数。
If you have to perform multiple search on the same array, these can be vectorized using SIMD extensions, thus further cutting down on computation cost.
如果您必须对同一个数组执行多次搜索,可以使用 SIMD 扩展对这些进行向量化,从而进一步降低计算成本。
In particular, if you are dealing with arrays of floating point numbers which satisfy certain properties, than there are ways to construct a special index which then allows to search the array in O(1).
特别是,如果您正在处理满足某些属性的浮点数数组,那么有一些方法可以构造一个特殊的索引,然后允许在 O(1) 中搜索数组。
All of the above aspects are discussed with test results in: Cannizzo, 2015, Fast and Vectorizable Alternative to Binary Search in O(1) Applicable to a Wide Domain of Sorted Arrays of Floating Point NumbersThe paper comes with source code on github.
以上所有方面都与测试结果一起讨论: Cannizzo, 2015, Fast and Vectorizable Alternative to Binary Search in O(1) Applicable to a Wide Domain of Sorted Arrays of Floating Point Numbers论文在github上附有源代码。
回答by Cheers and hth. - Alf
First of all, measurebefore doing optimization.
首先,在做优化之前先衡量。
Do you really need to optimize that search?
您真的需要优化该搜索吗?
If so, then secondly, think about algorithmic complexity first. E.g. can you use a tree (like a std::map
, say) instead of an array? If so then it depends on the relative frequency of insertions/deletions versus searches, but the premise of having a sorted array at hand indicates that searches are frequent compared to data set changes, so that it would make sense to do some little additional work for insertions/deletions, making each search much faster -- namely logarithmic time.
如果是这样,那么其次,首先考虑算法的复杂性。例如,您可以使用树(例如 a std::map
)而不是数组吗?如果是这样,那么它取决于插入/删除与搜索的相对频率,但手头有一个排序数组的前提表明,与数据集更改相比,搜索是频繁的,因此做一些额外的工作是有意义的插入/删除,使每次搜索更快——即对数时间。
If you find that indeed the search times are a bottleneck that needs addressing, and no, no change of data representation is possible, and the list is short, then a linear search will generally be faster because it does less work per comparision.
如果您发现搜索时间确实是一个需要解决的瓶颈,并且不,数据表示不可能发生变化,并且列表很短,那么线性搜索通常会更快,因为它每次比较的工作量更少。
Otherwise, if the list is longer, and no particular distribution of values is known or assumed, and the values can't be treated as numerical, and memory consumption should be constant (ruling out constructing a hash table, say), then binary search produces 1 bit of information per comparision and is probably the best you can do for the first search.
否则,如果列表更长,并且没有已知或假设值的特定分布,并且这些值不能被视为数字,并且内存消耗应该是恒定的(例如排除构建哈希表),那么二分查找每次比较产生 1 位信息,这可能是您第一次搜索时所能做的最好的。
Cheers & hth.
干杯& hth。
回答by srean
You can always put them in a hash table, then search will be O(1). It will be memory intensive though and if you keep adding items, the hash table might need to be re-bucketed. Re-bucketing is O(n) but it will get amortized to O(1). It essentially depends on whether you can afford that space and the potential cache misses.
您始终可以将它们放在哈希表中,然后搜索将是 O(1)。但是,它会占用大量内存,如果您不断添加项目,则可能需要重新存储哈希表。重新分桶是 O(n),但它会被摊销到 O(1)。这主要取决于您是否负担得起该空间以及潜在的缓存未命中。
回答by bjoernz
In binary search you split the list into two "sublists" and you only search the sublist that may contain the value. Depending on how large your array is, you could see a speedup if you split the array into more than two splices.
在二分搜索中,您将列表分成两个“子列表”,并且您只搜索可能包含该值的子列表。根据阵列的大小,如果将阵列拆分为两个以上的拼接,您可能会看到加速。
You can determine which region of the array you have to search, by keeping an index, that you search first. Like in a telephone book of a large city, where you can see from the outside, where you have to start to search. (I have trouble expressing my idea in text, and I did not find an english link yet that explains it better).
您可以通过保留首先搜索的索引来确定必须搜索数组的哪个区域。就像在大城市的电话簿里,从外面可以看到,从哪里开始搜索。(我无法用文字表达我的想法,而且我还没有找到更好地解释它的英文链接)。
回答by David
If you have a huge amount of numbers to find, and by some fluke they are ALSO sorted, you could do it in O(n + m) where m is the number of numbers to find. Basically just your typical merge algorithm, with slight modification to record which value each checked number would be inserted before, if it was to be inserted into the array.
如果你有大量的数字要查找,并且偶然地它们也被排序,你可以在 O(n + m) 中找到,其中 m 是要查找的数字的数量。基本上只是典型的合并算法,稍加修改以记录每个检查的数字之前将插入哪个值,如果要插入到数组中。
You can always trade off space... And time of other operations. Assuming all your elements are constant size p bits, you can make a massive array which stores, for each possible value you could look up, the index of the next bigger value currently stored. This array needs to be 2^p*lg(n) bits, where n is the number values stored. Each insertion or deletion is O(2^p) but typically around 2^p/n, because you have to go through updating all those indices.
你总是可以权衡空间......和其他操作的时间。假设您的所有元素都是恒定大小的 p 位,您可以创建一个庞大的数组,用于存储您可以查找的每个可能值,当前存储的下一个更大值的索引。该数组需要 2^p*lg(n) 位,其中 n 是存储的数值。每次插入或删除都是 O(2^p) 但通常在 2^p/n 左右,因为您必须更新所有这些索引。
But your lookup is now O(1)!
但是您的查找现在是 O(1)!
OK, OK, it's not really practical. But dividing the input into blocks in a similar fashion could possibly reduce the constant in front of your log. Possibly.
好吧好吧,这不是很实用。但是以类似的方式将输入分成块可能会减少日志前面的常量。可能。