在一组 27 个浮点值中选择中位数的最快代码 C/C++
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/810657/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Fastest code C/C++ to select the median in a set of 27 floating point values
提问by chmike
This is the well know select algorithm. see http://en.wikipedia.org/wiki/Selection_algorithm.
这是众所周知的选择算法。请参阅http://en.wikipedia.org/wiki/Selection_algorithm。
I need it to find the median value of a set of 3x3x3 voxel values. Since the volume is made of a billion voxels and the algorithm is recursive, it better be a little bit fast. In general it can be expected that values are relatively close.
我需要它来找到一组 3x3x3 体素值的中值。由于体积由十亿个体素组成并且算法是递归的,所以最好快一点。一般而言,可以预期值相对接近。
The fastest known algorithm I have tried out so far uses the quick sort partition function. I would like to know if there is a faster one.
迄今为止我尝试过的最快的已知算法使用快速排序分区函数。我想知道有没有更快的。
I've "invented" a 20% faster one using two heaps, but expected an even faster one using a hash. Before implementing this, I'd like to know if a blitz fast solution already exist out there.
我已经“发明”了一个使用两个堆的速度快 20% 的方法,但预计使用哈希的方法会更快。在实施此之前,我想知道是否已经存在闪电战快速解决方案。
The fact that I'm using floats shouldn't matter since they can be considered as unsigned integer after inverting the sign bit. The order will be preserved.
我使用浮点数的事实应该无关紧要,因为在反转符号位后它们可以被视为无符号整数。订单将被保留。
EDIT: benchmark and source code moved into a separate answer as suggested by Davy Landman. See below for the answer by chmike.
编辑:按照 Davy Landman 的建议,基准测试和源代码移到了单独的答案中。请参阅下面的 chmike 答案。
EDIT: The most efficient algorithm so far was referenced below by Boojum as a link to the Fast Median and Bilateral Filteringpaper which is now the answer to this question. The first smart idea of this method is to use radix sort, the second is to combine median search of adjacent pixels who share a lot of pixels.
编辑:迄今为止最有效的算法被 Boojum 在下面引用为快速中值和双边过滤论文的链接,现在是这个问题的答案。这种方法的第一个聪明的想法是使用基数排序,第二个是结合对共享大量像素的相邻像素的中值搜索。
采纳答案by Boojum
Since it sounds like you're performing a median filter on a large array of volume data, you might want to take a look at the Fast Median and Bilateral Filteringpaper from SIGGRAPH 2006. That paper deals with 2D image processing, but you might be able to adapt the algorithm for 3D volumes. If nothing else, it might give you some ideas on how to step back and look at the problem from a slightly different perspective.
由于听起来您正在对大量体积数据执行中值滤波器,因此您可能需要查看SIGGRAPH 2006的Fast Median and Bilateral Filtering论文。该论文涉及 2D 图像处理,但您可能能够适应 3D 体积的算法。如果不出意外,它可能会给您一些关于如何退后一步并从稍微不同的角度看待问题的想法。
回答by newacct
The selection algorithm is linear time (O(n)). Complexity-wise you can't do better than linear time, because it takes linear time to read in all the data. So you couldn't have made something that is faster complexity-wise. Perhaps you have something that is a constant factor faster on certain inputs? I doubt it would make much of a difference.
选择算法是线性时间 (O(n))。在复杂性方面,你不能比线性时间做得更好,因为读取所有数据需要线性时间。所以你不可能做出在复杂性方面更快的东西。也许您有某些输入的常数因子更快?我怀疑这会产生很大的不同。
C++ already includes the linear-time selection algorithm. Why not just use it?
C++ 已经包含了线性时间选择算法。为什么不直接使用它?
std::vector<YourType>::iterator first = yourContainer.begin();
std::vector<YourType>::iterator last = yourContainer.end();
std::vector<YourType>::iterator middle = first + (last - first) / 2;
std::nth_element(first, middle, last); // can specify comparator as optional 4th arg
YourType median = *middle;
Edit: Technically, that is only the median for a container of odd length. For one of even length, it will get the "upper" median. If you want the traditional definition of median for even length, you might have to run it twice, once for each of the two "middles" at first + (last - first) / 2
and first + (last - first) / 2 - 1
and then average them or something.
编辑:从技术上讲,这只是奇数长度容器的中位数。对于偶数长度之一,它将获得“上”中位数。如果你想为中位数,甚至长度的传统定义,你可能需要每进行一次两个“中段”的两倍运行,first + (last - first) / 2
并且first + (last - first) / 2 - 1
然后平均他们什么。
回答by chmike
EDIT: I have to apologize. The code below was WRONG. I have the fixed code, but need to find an icccompiler to redo the measurements.
编辑:我必须道歉。下面的代码是错误的。我有固定代码,但需要找到一个icc编译器来重做测量。
The benchmark results of the algorithms considered so far
到目前为止所考虑的算法的基准结果
For the protocol and short description of algorithms see below. First value is mean time (seconds) over 200 different sequences and second value is stdDev.
有关算法的协议和简短描述,请参见下文。第一个值是 200 个不同序列的平均时间(秒),第二个值是 stdDev。
HeapSort : 2.287 0.2097
QuickSort : 2.297 0.2713
QuickMedian1 : 0.967 0.3487
HeapMedian1 : 0.858 0.0908
NthElement : 0.616 0.1866
QuickMedian2 : 1.178 0.4067
HeapMedian2 : 0.597 0.1050
HeapMedian3 : 0.015 0.0049 <-- best
Protocol: generate 27 random floats using random bits obtained from rand(). Apply each algorithm 5 million times in a row (including prior array copy) and compute average and stdDev over 200 random sequences. C++ code compiled with icc -S -O3 and run on Intel E8400 with 8GB DDR3.
协议:使用从 rand() 获得的随机位生成 27 个随机浮点数。连续应用每种算法 500 万次(包括先前的数组副本)并计算 200 个随机序列的平均值和标准偏差。使用 icc -S -O3 编译的 C++ 代码并在具有 8GB DDR3 的 Intel E8400 上运行。
Algorithms:
算法:
HeapSort : full sort of sequence using heap sort and pick middle value. Naive implementation using subscript access.
HeapSort :使用堆排序和选择中间值的完整排序。使用下标访问的简单实现。
QuickSort: full in place sort of sequence using quick sort and pick middle value. Naive implementation using subscript access.
快速排序:使用快速排序和选择中间值的完整就地排序。使用下标访问的简单实现。
QuickMedian1: quick select algorithm with swapping. Naive implementation using subscript access.
QuickMedian1:带交换的快速选择算法。使用下标访问的简单实现。
HeapMedian1: in place balanced heap method with prior swapping. Naive implementation using subscript access.
HeapMedian1:使用预先交换的就地平衡堆方法。使用下标访问的简单实现。
NthElement : uses the nth_element STL algorithm. Data is copied into the vector using memcpy( vct.data(), rndVal, ... );
NthElement :使用 nth_element STL 算法。使用 memcpy( vct.data(), rndVal, ... ) 将数据复制到向量中;
QuickMedian2: uses quick select algorithm with pointers and copy in two buffers to avoid swaping. Based on proposal of MSalters.
QuickMedian2:使用带指针的快速选择算法并在两个缓冲区中复制以避免交换。基于 MSalters 的提议。
HeapMedian2 : variant of my invented algorithm using dual heaps with shared heads. Left heap has biggest value as head, right has smallest value as head. Initialize with first value as common head and first median value guess. Add subsequent values to left heap if smaller than head, otherwise to right heap, until one of the heap is full. It is full when it contains 14 values. Then consider only the full heap. If its the right heap, for all values bigger than the head, pop head and insert value. Ignore all other values. If its the left heap, for all values smaller than the head, pop head and insert it in heap. Ignore all other values. When all values have been proceeded, the common head is the median value. It uses integer index into array. The version using pointers (64bit) appeared to be nearly twice slower (~1s).
HeapMedian2 :我发明的算法的变体,使用具有共享头的双堆。左边的堆以最大的值作为头,右边的以最小的值作为头。用第一个值作为共同头和第一个中值猜测初始化。如果小于 head,则将后续值添加到左堆,否则添加到右堆,直到堆中的一个已满。当它包含 14 个值时它是满的。然后只考虑完整的堆。如果它是正确的堆,对于所有大于头部的值,弹出头部并插入值。忽略所有其他值。如果它是左堆,对于所有小于头的值,弹出头并将其插入堆中。忽略所有其他值。处理完所有值后,公共水头为中值。它使用整数索引到数组中。使用指针(64 位)的版本似乎慢了近两倍(~1s)。
HeapMedian3 : same algorithm as HeapMedian2 but optimized. It uses unsigned char index, avoids value swapping and various other little things. The mean and stdDev values are computed over 1000 random sequences. For nth_element I measured 0.508s and a stdDev of 0.159537 with the same 1000 random sequences. HeapMedian3 is thus 33 time faster than the nth_element stl function. Each returned median value is checked against the median value returned by heapSort and they all match. I doubt a method using hash may be significantly faster.
HeapMedian3 :与 HeapMedian2 相同的算法,但经过优化。它使用无符号字符索引,避免值交换和其他各种小事情。平均值和标准偏差值是在 1000 个随机序列上计算的。对于 nth_element,我用相同的 1000 个随机序列测量了 0.508s 和 0.159537 的 stdDev。因此 HeapMedian3 比 nth_element stl 函数快 33 倍。每个返回的中值都根据 heapSort 返回的中值进行检查,并且它们都匹配。我怀疑使用哈希的方法可能会快得多。
EDIT 1: This algorithm can be further optimized. The first phase where elements are dispatched in the left or right heap based on the comparison result doesn't need heaps. It is sufficient to simply append elements to two unordered sequences. The phase one stops as soon as one sequence is full, which means it contains 14 elements (including the median value). The second phase starts by heapifying the full sequence and then proceed as described in the HeapMedian3 algorithm. I'll provide the new code and benchmark as soon as possible.
编辑 1:可以进一步优化此算法。根据比较结果在左堆或右堆中调度元素的第一阶段不需要堆。简单地将元素附加到两个无序序列就足够了。第一个序列一满就停止,这意味着它包含 14 个元素(包括中值)。第二阶段从堆化完整序列开始,然后按照 HeapMedian3 算法中的描述进行。我会尽快提供新的代码和基准。
EDIT 2: I implemented and benchmarked the optimized algorithm. But there is no significant performance difference compared heapMedian3. It is even slightly slower on the average. Shown results are confirmed. There might be with much larger sets. Note also that I simply pick the first value as initial median guess. As suggested, one could benefit from the fact that we search a median value in "overlapping" value sets. Using the median of median algorithm would help to pick a much better initial median value guess.
编辑 2:我实现了优化算法并对其进行了基准测试。但是与 heapMedian3 相比没有显着的性能差异。它的平均速度甚至略慢。显示的结果得到确认。可能有更大的集合。另请注意,我只是选择第一个值作为初始中位数猜测。正如所建议的,我们可以从我们在“重叠”值集中搜索中值这一事实中受益。使用中位数算法将有助于选择更好的初始中值猜测。
Source code of HeapMedian3
HeapMedian3 的源代码
// return the median value in a vector of 27 floats pointed to by a
float heapMedian3( float *a )
{
float left[14], right[14], median, *p;
unsigned char nLeft, nRight;
// pick first value as median candidate
p = a;
median = *p++;
nLeft = nRight = 1;
for(;;)
{
// get next value
float val = *p++;
// if value is smaller than median, append to left heap
if( val < median )
{
// move biggest value to the heap top
unsigned char child = nLeft++, parent = (child - 1) / 2;
while( parent && val > left[parent] )
{
left[child] = left[parent];
child = parent;
parent = (parent - 1) / 2;
}
left[child] = val;
// if left heap is full
if( nLeft == 14 )
{
// for each remaining value
for( unsigned char nVal = 27 - (p - a); nVal; --nVal )
{
// get next value
val = *p++;
// if value is to be inserted in the left heap
if( val < median )
{
child = left[2] > left[1] ? 2 : 1;
if( val >= left[child] )
median = val;
else
{
median = left[child];
parent = child;
child = parent*2 + 1;
while( child < 14 )
{
if( child < 13 && left[child+1] > left[child] )
++child;
if( val >= left[child] )
break;
left[parent] = left[child];
parent = child;
child = parent*2 + 1;
}
left[parent] = val;
}
}
}
return median;
}
}
// else append to right heap
else
{
// move smallest value to the heap top
unsigned char child = nRight++, parent = (child - 1) / 2;
while( parent && val < right[parent] )
{
right[child] = right[parent];
child = parent;
parent = (parent - 1) / 2;
}
right[child] = val;
// if right heap is full
if( nRight == 14 )
{
// for each remaining value
for( unsigned char nVal = 27 - (p - a); nVal; --nVal )
{
// get next value
val = *p++;
// if value is to be inserted in the right heap
if( val > median )
{
child = right[2] < right[1] ? 2 : 1;
if( val <= right[child] )
median = val;
else
{
median = right[child];
parent = child;
child = parent*2 + 1;
while( child < 14 )
{
if( child < 13 && right[child+1] < right[child] )
++child;
if( val <= right[child] )
break;
right[parent] = right[child];
parent = child;
child = parent*2 + 1;
}
right[parent] = val;
}
}
}
return median;
}
}
}
}
回答by stephan
The question cannot easily be answered for the simple reason that the performance of one algorithm relative to another depends as much the on compiler / processor / data structure combination as on the algorithm itself, as you surely know
这个问题不容易回答,原因很简单,一种算法相对于另一种算法的性能取决于编译器/处理器/数据结构组合以及算法本身,正如您肯定知道的
Therefore your approach to try a couple of them seems good enough. And yes, quicksort should be pretty fast. If you haven't done so, you might want to try insertionsort which often performs better on small data sets. This said, just settle on a sorting algo that does the job fast enough. You will typically not get 10-times faster just be picking the "right" algo.
因此,您尝试其中几个的方法似乎已经足够好了。是的,快速排序应该很快。如果您还没有这样做,您可能想尝试插入排序,它通常在小数据集上表现更好。这就是说,只需选择一种能够足够快地完成工作的排序算法。仅仅选择“正确”的算法通常不会使速度提高 10 倍。
To get substantial speed-ups, the better way frequently is to use more structure. Some ideas that worked for me in the past with large-scale problems:
为了获得实质性的加速,更好的方法是经常使用更多的结构。过去在大规模问题中对我有用的一些想法:
Can you efficiently pre-calculate while creating the voxels and store 28 instead of 27 floats?
Is an approximate solution good enough? If so, just look at the median of, say 9 values, since "in general it can be expected that values are relatively close." Or you can replace it with the average as long as the values are relatively close.
Do you really need the median for all billions of voxels? Maybe you have an easy test whether you need the median, and can then only calculate for the relevant sub-set.
If nothing else helps: look at the asm code that the compiler generates. You might be able write asm code that is substantially faster (e.g. by doing all the calcs using registers).
您能否在创建体素时有效地预先计算并存储 28 个而不是 27 个浮点数?
近似解是否足够好?如果是这样,只需查看 9 个值的中位数,因为“通常可以预期这些值相对接近”。或者,只要值相对接近,您就可以将其替换为平均值。
你真的需要所有数十亿体素的中位数吗?也许你有一个简单的测试你是否需要中位数,然后只能计算相关的子集。
如果没有其他帮助:查看编译器生成的 asm 代码。您可能能够编写明显更快的汇编代码(例如,通过使用寄存器进行所有计算)。
Edit:For what it's worth, I have attached the (partial) insertionsort code mentioned in the comment below (totally untested). If numbers[]
is an array of size N
, and you want the smallest P
floats sorted at the beginning of the array, call partial_insertionsort<N, P, float>(numbers);
. Hence if you call partial_insertionsort<27, 13, float>(numbers);
, numbers[13]
will contain the median. To gain additional speed, you would have to unfold the while loop, too. As discussed above, to get really fast, you have to use your knowledge about the data (e.g. is the data already partially sorted? Do you know properties of the distribution of the data? I guess, you get the drift).
编辑:对于它的价值,我附上了下面评论中提到的(部分)插入排序代码(完全未经测试)。如果numbers[]
是 size 的数组N
,并且您希望将最小的P
浮点数排序在数组的开头,请调用partial_insertionsort<N, P, float>(numbers);
. 因此,如果您调用partial_insertionsort<27, 13, float>(numbers);
,numbers[13]
将包含中位数。为了获得额外的速度,您还必须展开 while 循环。如上所述,要获得真正的快速,您必须使用您对数据的了解(例如,数据是否已经部分排序?您知道数据分布的属性吗?我想,您会得到漂移)。
template <long i> class Tag{};
template<long i, long N, long P, typename T>
inline void partial_insertionsort_for(T a[], Tag<N>, Tag<i>)
{ long j = i <= P+1 ? i : P+1; // partial sort
T temp = a[i];
a[i] = a[j]; // compiler should optimize this away where possible
while(temp < a[j - 1] && j > 0)
{ a[j] = a[j - 1];
j--;}
a[j] = temp;
partial_insertionsort_for<i+1,N,P,T>(a,Tag<N>(),Tag<i+1>());}
template<long i, long N, long P, typename T>
inline void partial_insertionsort_for(T a[], Tag<N>, Tag<N>){}
template <long N, long P, typename T>
inline void partial_insertionsort(T a[])
{partial_insertionsort_for<0,N,P,T>(a, Tag<N>(), Tag<0>());}
回答by MSalters
The most likely algorithm to use in your first attempt is just nth_element; it pretty much gives you what you want directly. Just ask for the 14th element.
第一次尝试时最有可能使用的算法就是 nth_element;它几乎可以直接为您提供您想要的东西。只需要求第 14 个元素。
On your second attempt, the goal is to take advantage of the fixed data size. You do not wnat to allocate any memory at all duing your algorithm. So, copy your voxel values to a pre-allocated array of 27 elements. Pick a pivot, and copy it to the middle of a 53 element array. Copy the remaining values to either side of the pivot. Here you keep two pointers (float* left = base+25, *right=base+27
). There are now three possibilities: the left side is larger, the right side is larger, or the both have 12 elements. The last case is trivial; your pivot is the median. Otherwise, call nth_element on either the left side or the right side. The exact value of Nth depends on how many values were larger or smaller than the pivot. For instance, if the division is 12/14, you need the smallest element bigger than the pivot, so Nth=0, and if the division was 14/12, you need the biggest element smaller the pivot, so Nth=13. The worst cases are 26/0 and 0/26, when your pivot was an extreme, but those happen only in 2/27th of all cases.
第二次尝试时,目标是利用固定的数据大小。在您的算法期间,您根本不想分配任何内存。因此,将您的体素值复制到一个包含 27 个元素的预分配数组中。选择一个枢轴,并将其复制到 53 个元素数组的中间。将剩余的值复制到枢轴的任一侧。在这里你保持两个指针(float* left = base+25, *right=base+27
)。现在有三种可能:左侧更大,右侧更大,或者两者都有 12 个元素。最后一种情况是微不足道的。你的枢轴是中位数。否则,在左侧或右侧调用 nth_element。Nth 的确切值取决于有多少个值大于或小于枢轴。例如,如果除法是 12/14,你需要比主元大的最小元素,所以 Nth=0,如果除法是 14/12,你需要比主元小的最大元素,所以 Nth=13。最坏的情况是 26/0 和 0/26,当你的枢轴是一个极端时,但这些只发生在所有情况的 2/27 中。
The third improvement (or the first, if you have to use C and do not have nth_element) replaces nth_element entirely. You still have the 53 element array, but this time you fill it directly from the voxel values (saving you an interim copy into a float[27]
). The pivot in this first iteration is just voxel[0][0][0]. For subsequent iterations, you use a second pre-allocated float[53]
(easier if both are the same size) and copy floats between the two. The basic iteration step here is still: copy the pivot to the middle, sort the rest to the left and the right. At the end of each step, you'll know whether the median is smaller or larger than the current pivot, so you can discard the floats bigger or smaller than that pivot. Per iteration, this eliminates between 1 and 12 elements, with an average of 25% of the remaining.
第三个改进(或第一个,如果您必须使用 C 并且没有 nth_element)完全替换 nth_element。您仍然拥有 53 个元素的数组,但这次您直接从体素值填充它(将临时副本保存到float[27]
)。第一次迭代中的枢轴只是体素[0][0][0]。对于后续迭代,您使用第二个预分配float[53]
(如果两者大小相同,则更容易)并在两者之间复制浮点数。这里的基本迭代步骤仍然是:将枢轴复制到中间,将其余的向左和向右排序。在每一步结束时,您将知道中位数是否小于或大于当前枢轴,因此您可以丢弃大于或小于该枢轴的浮点数。每次迭代,这会消除 1 到 12 个元素,平均剩余 25%。
The final iteration, if you still need more speed, is based on the observation that most of your voxels overlap significantly. You pre-calculate for every 3x3x1 slice the median value. Then, when you need an initial pivot for your 3x3x3 voxel cube, you take the median of the the three. You know a priori that there are 9 voxels smaller and 9 voxels larger than that median of medians (4+4+1). So, after the first pivotting step, the worst cases are a 9/17 and a 17/9 split. So, you'd only need to find the 4th or 13th element in a float[17], instead of the 12th or 14th in a float[26].
如果您仍然需要更高的速度,那么最后的迭代是基于对大多数体素显着重叠的观察。您预先计算每个 3x3x1 切片的中值。然后,当你需要一个 3x3x3 体素立方体的初始支点时,你取三个的中值。您先验地知道有 9 个体素比中位数 (4+4+1) 的中位数小 9 个体素和大 9 个体素。因此,在第一个旋转步骤之后,最坏的情况是 9/17 和 17/9 拆分。因此,您只需要在 float[17] 中找到第 4 个或第 13 个元素,而不是在 float[26] 中找到第 12 个或第 14 个元素。
Background: The idea of copying first a pivot and then the rest of a float[N] to a float[2N-1], using left and right pointers is that you fill a float[N] subarray around the pivot, with all elements smaller than the pivot to the left (lower index) and higher to the right (higher index). Now, if you want the Mth element, you might find yourself lucky and have M-1 elements smaller than the pivot, in which case the pivot is the element you need. If there are more than (M-1) elements smaller than the pivot, the Mth element is amongst them, so you can discard the pivot and anything bigger than the pivot, and seacrh for the Mth element in all the lower values. If there are lessthan (M-1) elements smaller than the pivot, you're looking for a value higher than the pivot. So, you'll discard the pivot and anything smaller than it. Let the number of elements lessthan the pivot, i.e. to the left of the pivot be L. In the next iteration, you want the (M-L-1)th element of the (N-L-1)floats that are bigger than the pivot.
背景:首先复制一个枢轴,然后将 float[N] 的其余部分复制到一个 float[2N-1],使用左右指针的想法是你在枢轴周围填充一个 float[N] 子数组,其中包含所有元素小于左侧的枢轴(较低的索引)和较高的右侧(较高的索引)。现在,如果您想要第 M 个元素,您可能会发现自己很幸运,并且有 M-1 个小于主元的元素,在这种情况下,主元就是您需要的元素。如果有超过 (M-1) 个元素小于主元,则第 M 个元素在其中,因此您可以丢弃主元和任何大于主元的元素,并在所有较低值中搜索第 M 个元素。如果人少比 (M-1) 个小于枢轴的元素,您正在寻找高于枢轴的值。因此,您将丢弃枢轴和任何小于它的东西。让小于枢轴的元素数量,即在枢轴的左侧为 L。在下一次迭代中,您希望 (NL-1) 个浮点数的第 (ML-1) 个元素大于枢轴。
This kind of nth_element algorithm is fairly efficient because most of the work is spent copying floats between two small arrays, both of which will be in cache, and because your state is most of the time represented by 3 pointers (source pointer, left destination pointer, right destination pointer).
这种 nth_element 算法相当有效,因为大部分工作都花在了在两个小数组之间复制浮点数上,这两个数组都将在缓存中,并且因为您的状态大部分时间由 3 个指针表示(源指针,左目标指针) ,右目标指针)。
To show the basic code:
显示基本代码:
float in[27], out[53];
float pivot = out[26] = in[0]; // pivot
float* left = out+25, right = out+27
for(int i = 1; i != 27; ++1)
if((in[i]<pivot)) *left-- = in[i] else *right++ = in[i];
// Post-condition: The range (left+1, right) is initialized.
// There are 25-(left-out) floats <pivot and (right-out)-27 floats >pivot
回答by matja
A sorting network generated using the Bose-Nelson algorithm will find the median directly with no loops/recursion using 173 comparisons. If you have the facility to perform comparisions in parallel such as usage of vector-arithmetic instructions then you may be able to group the comparisions into as few as 28 parallel operations.
使用 Bose-Nelson 算法生成的排序网络将使用 173 次比较直接找到中位数,无需循环/递归。如果您有能力并行执行比较,例如使用向量算术指令,那么您可以将比较分组为少至 28 个并行操作。
If you are sure that the floats are normalized and not (qs)NaN's, then you can use integer operations to compare IEEE-754 floats which can perform more favorably on some CPU's.
如果您确定浮点数是规范化的而不是 (qs)NaN 的,那么您可以使用整数运算来比较 IEEE-754 浮点数,它可以在某些 CPU 上更有利地执行。
A direct conversion of this sorting network to C (gcc 4.2) yields a worst-case of 388 clock cycles on my Core i7.
将此排序网络直接转换为 C (gcc 4.2) 会在我的 Core i7 上产生 388 个时钟周期的最坏情况。
回答by matja
I suppose your best bet is to take an existing sorting algorithm and try to figure out whether you can adapt it so that the set does not need to be fully sorted. For determining the median, you need at most half the values sorted, either the lower or higher half would be enough:
我想最好的办法是采用现有的排序算法并尝试弄清楚是否可以对其进行调整,以便不需要对集合进行完全排序。为了确定中位数,您最多需要排序值的一半,下半部分或上半部分就足够了:
original: | 5 | 1 | 9 | 3 | 3 |
sorted: | 1 | 3 | 3 | 5 | 9 |
lower half sorted: | 1 | 3 | 3 | 9 | 5 |
higher half sorted: | 3 | 1 | 3 | 5 | 9 |
The other half would be a bucket of unsorted values that merely share the property of being larger/smaller or equal to the largest/smallest sorted value.
另一半将是一桶未排序的值,它们仅共享更大/更小或等于最大/最小排序值的属性。
But I have no ready algorithm for that, it's just an idea of how you might take a short-cut in your sorting.
但是我没有现成的算法,这只是一个关于如何在排序中走捷径的想法。
回答by Mark Ruzon
Alex Stepanov's new book Elements of Programmingtalks at some length about finding order statistics using the minimum number of average comparisons while minimizing runtime overhead. Unfortunately, a sizable amount of code is needed just to compute the median of 5 elements, and even then he gives as a project finding an alternate solution that uses a fraction of a comparison less on average, so I wouldn't dream of extending that framework to finding the median of 27 elements. And the book won't even be available until 15 June 2009. The point is that because this is a fixed-size problem, there is a direct comparison method that is provably optimal.
Alex Stepanov 的新书Elements of Programming 详细讨论了如何使用最少的平均比较次数查找订单统计信息,同时最大限度地减少运行时开销。不幸的是,只需要计算 5 个元素的中位数就需要大量代码,即便如此,他作为一个项目给出了一个替代解决方案,该解决方案平均使用较少的比较的一小部分,所以我不会梦想扩展它寻找 27 个元素的中位数的框架。而且这本书甚至要到 2009 年 6 月 15 日才能提供。关键是因为这是一个固定大小的问题,所以有一种直接比较方法可以证明是最佳的。
Also, there is the fact that this algorithm is not being run once in isolation but rather many times, and between most runs only 9 of the 27 values will change. That means in theory some of the work is done already. However, I have not heard of any median filtering algorithms in image processing that take advantage of this fact.
此外,还有一个事实是,该算法不会单独运行一次,而是多次运行,并且在大多数运行之间,27 个值中只有 9 个会发生变化。这意味着理论上一些工作已经完成。但是,我还没有听说过在图像处理中利用这一事实的任何中值滤波算法。
回答by Shing Yip
+1 for everybody who mentioned nth_element, but this kind of code is where hand written algorithm is better than STL because you want to generate the most efficient code for that one compiler running on the one CPU with a specific data set. For example, for some CPU/compiler combination std::swap(int, int) maybe slower than hand written swap using XOR (before you reply, i know this is probably true 20 years ago but not anymore). Sometimes performance is gained by hand writing assembly code specific to your CPU. If you plan to take advantage of GPU's stream processors, you may have to design your algorithm accordingly.
+1 对于提到 nth_element 的每个人,但这种代码是手写算法优于 STL 的地方,因为您希望为在具有特定数据集的一个 CPU 上运行的那个编译器生成最有效的代码。例如,对于某些 CPU/编译器组合 std::swap(int, int) 可能比使用 XOR 的手写交换慢(在您回复之前,我知道这在 20 年前可能是真的,但现在不是了)。有时,性能是通过手写特定于您的 CPU 的汇编代码来获得的。如果您打算利用 GPU 的流处理器,您可能必须相应地设计您的算法。
You mentioned using 2 heaps and keep track of the median as you insert. That's what i did a while ago in a project. I changed the array inplace and used only one heap. I could not think of any faster algorithm, but i'd like to caution you about memory usage, specifically CPU cache memory. You want to be careful with memory access. CPU cache is swapped in and out by page, so you want your algorithm to touch memory that are close together to minimize CPU cache miss.
您提到使用 2 个堆并在插入时跟踪中位数。这就是我前一段时间在一个项目中所做的。我就地更改了数组并且只使用了一个堆。我想不出任何更快的算法,但我想提醒您注意内存使用,特别是 CPU 缓存。你要小心内存访问。CPU 缓存按页换入和换出,因此您希望您的算法接触靠近在一起的内存,以最大限度地减少 CPU 缓存未命中。
回答by barre
When having let's say a million different values from which you need the median. Is it possible to base your median on a subset of those million, let's say 10%. So that the median is close to the n-th element which divides the values in 2 equal (or almost equal) subsets? Therefor, for finding the median you'll need less than O(n)-times (in this case O(1/10n) and hereby come closer to optimal sorting with quicksort in O(nlogn)?
当假设有一百万个不同的值时,您需要从中值。是否有可能将您的中位数基于那百万人的一个子集,比如说 10%。这样中位数接近第 n 个元素,该元素将值划分为 2 个相等(或几乎相等)的子集?因此,为了找到中位数,您将需要少于 O(n) 次(在这种情况下为 O(1/10n),从而更接近于在 O(nlogn) 中使用快速排序进行最佳排序?