python 查找未排序列表的第 N 项而不对列表进行排序

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1034846/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 21:19:56  来源:igfitidea点击:

Finding Nth item of unsorted list without sorting the list

pythonarrayssorting

提问by ooboo

Hey. I have a very large array and I want to find the Nth largest value. Trivially I can sort the array and then take the Nth element but I'm only interested in one element so there's probably a better way than sorting the entire array...

嘿。我有一个非常大的数组,我想找到第 N 个最大值。简单地说,我可以对数组进行排序,然后取第 N 个元素,但我只对一个元素感兴趣,所以可能有比对整个数组排序更好的方法......

采纳答案by Dario

Sorting would require O(nlogn) runtime at minimum - There are very efficient selection algorithmswhich can solve your problem in linear time.

排序至少需要 O(nlogn) 运行时间 - 有非常有效的选择算法可以在线性时间内解决您的问题。

Partition-based selection(sometimes Quick select), which is based on the idea of quicksort (recursive partitioning), is a good solution (see link for pseudocode + Another example).

Partition-based selection(有时Quick select),它基于快速排序(递归分区)的思想,是一个很好的解决方案(伪代码+另一个例子见链接)。

回答by FogleBird

A heap is the best data structure for this operation and Python has an excellent built-in library to do just this, called heapq.

堆是此操作的最佳数据结构,Python 有一个出色的内置库来执行此操作,称为 heapq。

import heapq

def nth_largest(n, iter):
    return heapq.nlargest(n, iter)[-1]

Example Usage:

示例用法:

>>> import random
>>> iter = [random.randint(0,1000) for i in range(100)]
>>> n = 10
>>> nth_largest(n, iter)
920

Confirm result by sorting:

通过排序确认结果:

>>> list(sorted(iter))[-10]
920

回答by user183037

You could try the Median of Medians method - it's speed is O(N).

您可以尝试 Median of Medians 方法 - 它的速度是 O(N)。

回答by Andrew Hare

You can iterate the entire sequence maintaining a list of the 5 largest values you find (this will be O(n)). That being said I think it would just be simpler to sort the list.

您可以迭代整个序列,维护您找到的 5 个最大值的列表(这将是 O(n))。话虽如此,我认为对列表进行排序会更简单。

回答by SPWorley

A simple modified quicksort works very well in practice. It has average running time proportional to N (though worst case bad luck running time is O(N^2)).

一个简单的修改过的快速排序在实践中效果很好。它的平均运行时间与 N 成正比(尽管最坏的情况下运气不好的运行时间是 O(N^2))。

Proceed like a quicksort. Pick a pivot value randomly, then stream through your values and see if they are above or below that pivot value and put them into two bins based on that comparison. In quicksort you'd then recursively sort each of those two bins. But for the N-th highest value computation, you only need to sort ONE of the bins.. the population of each bin tells you which bin holds your n-th highest value. So for example if you want the 125th highest value, and you sort into two bins which have 75 in the "high" bin and 150 in the "low" bin, you can ignore the high bin and just proceed to finding the 125-75=50th highest value in the low bin alone.

像快速排序一样进行。随机选择一个枢轴值,然后遍历您的值并查看它们是高于还是低于该枢轴值,然后根据该比较将它们放入两个容器中。在快速排序中,您将递归地对这两个垃圾箱中的每一个进行排序。但是对于第 N 个最高值的计算,您只需要对一个 bin 进行排序。每个 bin 的数量会告诉您哪个 bin 拥有您的第 n 个最高值。因此,例如,如果您想要第 125 个最高值,并且您将“高”箱中的 75 和“低”箱中的 150 分为两个箱,您可以忽略高箱并继续查找 125-75 = 仅在低档中的第 50 个最高值。

回答by Jeff Meatball Yang

You essentially want to produce a "top-N" list and select the one at the end of that list.

您基本上想要生成一个“前 N 个”列表并选择该列表末尾的列表。

So you can scan the array once and insert into an empty list when the largeArray item is greater than the last item of your top-N list, then drop the last item.

因此,您可以扫描一次数组并在 largeArray 项大于前 N 列表的最后一项时插入到空列表中,然后删除最后一项。

After you finish scanning, pick the last item in your top-N list.

完成扫描后,选择前 N 个列表中的最后一项。

An example for ints and N = 5:

整数和 N = 5 的示例:

int[] top5 = new int[5]();
top5[0] = top5[1] = top5[2] = top5[3] = top5[4] = 0x80000000; // or your min value

for(int i = 0; i < largeArray.length; i++) {
    if(largeArray[i] > top5[4]) {
       // insert into top5:
       top5[4] = largeArray[i];

       // resort:
       quickSort(top5);
    }
}

回答by UncleO

Use heapsort. It only partially orders the list until you draw the elements out.

使用堆排序。它只对列表进行部分排序,直到您绘制元素为止。

回答by Unknown

As people have said, you can walk the list once keeping track of K largest values. If K is large this algorithm will be close to O(n2).

正如人们所说,一旦跟踪 K 个最大值,您就可以遍历列表。如果 K 很大,该算法将接近 O(n 2)。

However, you can store your Kth largest values as a binary tree and the operation becomes O(n log k).

但是,您可以将第 K 个最大值存储为二叉树,并且操作变为 O(n log k)。

According to Wikipedia, this is the best selection algorithm:

根据维基百科,这是最好的选择算法:

 function findFirstK(list, left, right, k)
     if right > left
         select pivotIndex between left and right
         pivotNewIndex := partition(list, left, right, pivotIndex)
         if pivotNewIndex > k  // new condition
             findFirstK(list, left, pivotNewIndex-1, k)
         if pivotNewIndex < k
             findFirstK(list, pivotNewIndex+1, right, k)

Its complexity is O(n)

它的复杂度是 O(n)

回答by John Nelson

One thing you should do if this is in production code is test with samples of your data. For example, you might consider 1000 or 10000 elements 'large' arrays, and code up a quickselect method from a recipe.

如果这是在生产代码中,您应该做的一件事是使用您的数据样本进行测试。例如,您可能会考虑 1000 或 10000 个元素的“大”数组,并从配方中编写快速选择方法。

The compiled nature of sorted, and its somewhat hidden and constantly evolving optimizations, make it faster than a python written quickselect method on small to medium sized datasets (< 1,000,000 elements). Also, you might find as you increase the size of the array beyond that amount, memory is more efficiently handled in native code, and the benefit continues.

sorted 的编译特性,以及它在某种程度上隐藏且不断发展的优化,使其比 Python 编写的在中小型数据集(< 1,000,000 个元素)上编写的 quickselect 方法更快。此外,您可能会发现,当您将数组的大小增加到超过该数量时,在本机代码中可以更有效地处理内存,并且好处还在继续。

So, even if quickselect is O(n) vs sorted's O(nlogn), that doesn't take into account how many actual machine code instructions processing each n elements will take, any impacts on pipelining, uses of processor caches and other things the creators and maintainers of sorted will bake into the python code.

因此,即使快速选择是 O(n) 与排序的 O(nlogn),这也没有考虑处理每个 n 元素将采用多少实际机器代码指令、对流水线的任何影响、处理器缓存的使用和其他事情sorted 的创建者和维护者将融入 Python 代码。