list 从链表中有效地选择一组随机元素

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/54059/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-11 01:17:37  来源:igfitidea点击:

Efficiently selecting a set of random elements from a linked list

algorithmlanguage-agnosticlist

提问by Matt Sheppard

Say I have a linked list of numbers of length N. Nis very large and I don't know in advance the exact value of N.

假设我有一个长度为数字的链表NN非常大,我事先不知道N.

How can I most efficiently write a function that will return kcompletely random numbersfrom the list?

我怎样才能最有效地编写一个从列表中返回k完全随机数的函数?

采纳答案by Richard

There's a very nice and efficient algorithm for this using a method called reservtheitroad sampling.

有一个非常好的和有效的算法,使用一种称为水库采样的方法。

Let me start by giving you its history:

让我先给你它的历史

Knuthcalls this Algorithm R on p. 144 of his 1997 edition of Seminumerical Algorithms (volume 2 of The Art of Computer Programming), and provides some code for it there. Knuth attributes the algorithm to Alan G. Waterman. Despite a lengthy search, I haven't been able to find Waterman's original document, if it exists, which may be why you'll most often see Knuth quoted as the source of this algorithm.

Knuth在 p 上将此算法称为 R。他 1997 年版 Seminumerical Algorithms(计算机编程艺术第 2 卷)的第 144 条,并在那里提供了一些代码。Knuth 将该算法归功于 Alan G. Waterman。尽管进行了长时间的搜索,但我还是无法找到 Waterman 的原始文档(如果存在),这可能是您最常看到 Knuth 引用作为该算法来源的原因。

McLeod and Bellhouse, 1983(1) provide a more thorough discussion than Knuth as well as the first published proof (that I'm aware of) that the algorithm works.

McLeod 和 Bellhouse,1983(1) 提供了比 Knuth 更彻底的讨论,以及第一个发布的证明(我知道)该算法有效。

Vitter 1985(2) reviews Algorithm R and then presents an additional three algorithms which provide the same output, but with a twist. Rather than making a choice to include or skip each incoming element, his algorithm predetermines the number of incoming elements to be skipped. In his tests (which, admittedly, are out of date now) this decreased execution time dramatically by avoiding random number generation and comparisons on each in-coming number.

Vitter 1985(2) 回顾了算法 R,然后提出了另外三种算法,它们提供相同的输出,但有所不同。他的算法不是选择包含还是跳过每个传入元素,而是预先确定要跳过的传入元素的数量。在他的测试中(不可否认,现在已经过时了)通过避免随机数的生成和对每个传入数字的比较,这大大减少了执行时间。

In pseudocodethe algorithm is:

伪代码中,算法是:

Let R be the result array of size s
Let I be an input queue

> Fill the reservtheitroad array
for j in the range [1,s]:
  R[j]=I.pop()

elements_seen=s
while I is not empty:
  elements_seen+=1
  j=random(1,elements_seen)       > This is inclusive
  if j<=s:
    R[j]=I.pop()
  else:
    I.pop()

Note that I've specifically written the code to avoid specifying the size of the input. That's one of the cool properties of this algorithm: you can run it without needing to know the size of the input beforehand and it stillassures you that each element you encounter has an equal probability of ending up in R(that is, there is no bias). Furthermore, Rcontains a fair and representative sample of the elements the algorithm has considered at all times. This means you can use this as an online algorithm.

请注意,我专门编写了代码以避免指定输入的大小。这是这个算法的一个很酷的特性:你可以运行它而无需事先知道输入的大小,它仍然可以确保你遇到的每个元素都有相等的概率结束R(也就是说,没有偏差)。此外,R包含算法一直考虑的元素的公平和代表性样本。这意味着您可以将其用作在线算法

Why does this work?

为什么这样做?

McLeod and Bellhouse (1983) provide a proof using the mathematics of combinations. It's pretty, but it would be a bit difficult to reconstruct it here. Therefore, I've generated an alternative proof which is easier to explain.

McLeod 和 Bellhouse (1983) 提供了使用组合数学的证明。它很漂亮,但在这里重建它会有点困难。因此,我生成了一个更容易解释的替代证明。

We proceed via proof by induction.

我们通过归纳证明进行。

Say we want to generate a set of selements and that we have already seen n>selements.

假设我们想要生成一组s元素并且我们已经看到了n>s元素。

Let's assume that our current selements have already each been chosen with probability s/n.

让我们假设我们当前的s元素都已经被概率选择了s/n

By the definition of the algorithm, we choose element n+1with probability s/(n+1).

根据算法的定义,我们选择n+1概率为 的元素s/(n+1)

Each element already part of our result set has a probability 1/sof being replaced.

每个已经成为我们结果集一部分的元素都有1/s被替换的可能性。

The probability that an element from the n-seen result set is replaced in the n+1-seen result set is therefore (1/s)*s/(n+1)=1/(n+1). Conversely, the probability that an element is not replaced is 1-1/(n+1)=n/(n+1).

因此,在n-seen 结果集中替换 -seen 结果集中的元素的概率n+1(1/s)*s/(n+1)=1/(n+1)。相反,元素未被替换的概率为1-1/(n+1)=n/(n+1)

Thus, the n+1-seen result set contains an element either if it was part of the n-seen result set and was not replaced---this probability is (s/n)*n/(n+1)=s/(n+1)---or if the element was chosen---with probability s/(n+1).

因此,如果n+1-seen 结果集包含一个元素,如果它是n-seen 结果集的一部分并且没有被替换——这个概率是(s/n)*n/(n+1)=s/(n+1)——或者如果元素被选择——概率为s/(n+1)

The definition of the algorithm tells us that the first selements are automatically included as the first n=smembers of the result set. Therefore, the n-seenresult set includes each element with s/n(=1) probability giving us the necessary base case for the induction.

算法的定义告诉我们,第一个s元素会自动作为n=s结果集的第一个成员包含在内。因此,n-seen结果集包括具有s/n(=1) 概率的每个元素,为我们提供了归纳所需的基本情况。

References

参考

  1. McLeod, A. Ian, and David R. Bellhouse. "A convenient algorithm for drawing a simple random sample." Journal of the Royal Statistical Society. Series C (Applied Statistics) 32.2 (1983): 182-184. (Link)

  2. Vitter, Jeffrey S. "Random sampling with a reservtheitroad." ACM Transactions on Mathematical Software (TOMS) 11.1 (1985): 37-57. (Link)

  1. McLeod、A. Ian 和 David R. Bellhouse。“一种用于绘制简单随机样本的便捷算法。” 皇家统计学会杂志。系列 C(应用统计)32.2(1983):182-184。(链接)

  2. Vitter, Jeffrey S.“水库随机采样。” ACM 数学软件交易 (TOMS) 11.1 (1985):37-57。(链接)

回答by Bill the Lizard

This is called a Reservtheitroad Samplingproblem. The simple solution is to assign a random number to each element of the list as you see it, then keep the top (or bottom) k elements as ordered by the random number.

这称为水库采样问题。简单的解决方案是为列表的每个元素分配一个随机数,然后按照随机数的顺序保留顶部(或底部)k 个元素。

回答by Tom Hawtin - tackline

I would suggest: First find your k random numbers. Sort them. Then traverse both the linked list and your random numbers once.

我建议:首先找到你的 k 个随机数。对它们进行排序。然后遍历链表和随机数一次。

If you somehow don't know the length of your linked list (how?), then you could grab the first k into an array, then for node r, generate a random number in [0, r), and if that is less than k, replace the rth item of the array. (Not entirely convinced that doesn't bias...)

如果您不知何故不知道链表的长度(如何?),那么您可以将第一个 k 抓取到一个数组中,然后对于节点 r,在 [0, r) 中生成一个随机数,如果小于比k,替换数组的第r个项目。(不完全相信没有偏见......)

Other than that: "If I were you, I wouldn't be starting from here." Are you sure linked list is right for your problem? Is there not a better data structure, such as a good old flat array list.

除此之外:“如果我是你,我就不会从这里开始。” 您确定链表适合您的问题吗?有没有更好的数据结构,比如一个很好的旧平面数组列表。

回答by mweerden

If you don't know the length of the list, then you will have to traverse it complete to ensure random picks. The method I've used in this case is the one described by Tom Hawtin (54070). While traversing the list you keep kelements that form your random selection to that point. (Initially you just add the first kelements you encounter.) Then, with probability k/i, you replace a random element from your selection with the ith element of the list (i.e. the element you are at, at that moment).

如果您不知道列表的长度,则必须完整遍历它以确保随机选择。我在这种情况下使用的方法是 Tom Hawtin ( 54070)描述的方法。在遍历列表时,您将k形成随机选择的元素保留到该点。(最初,您只需添加k您遇到的第一个元素。)然后,以概率k/i,您将选择中的随机元素替换为i列表中的第 th 个元素(即您当时所在的元素)。

It's easy to show that this gives a random selection. After seeing melements (m > k), we have that each of the first melements of the list are part of you random selection with a probability k/m. That this initially holds is trivial. Then for each element m+1, you put it in your selection (replacing a random element) with probability k/(m+1). You now need to show that all other elements also have probability k/(m+1)of being selected. We have that the probability is k/m * (k/(m+1)*(1-1/k) + (1-k/(m+1)))(i.e. probability that element was in the list times the probability that it is still there). With calculus you can straightforwardly show that this is equal to k/(m+1).

很容易证明这给出了随机选择。在看到m元素 ( m > k) 之后,我们发现m列表的第一个元素都是随机选择的一部分,概率为k/m。这最初成立是微不足道的。然后对于每个元素m+1,您将其放入您的选择(替换随机元素)的概率k/(m+1)。您现在需要证明所有其他元素也有k/(m+1)被选中的概率。我们有概率是k/m * (k/(m+1)*(1-1/k) + (1-k/(m+1)))(即元素在列表中的概率乘以它仍然存在的概率)。使用微积分,您可以直接证明这等于k/(m+1)

回答by Christian Oudard

Well, you do need to know what N is at runtime at least, even if this involves doing an extra pass over the list to count them. The simplest algorithm to do this is to just pick a random number in N and remove that item, repeated k times. Or, if it is permissible to return repeat numbers, don't remove the item.

好吧,您确实需要至少在运行时知道 N 是什么,即使这涉及对列表进行额外的传递以计算它们。最简单的算法是在 N 中选择一个随机数并删除该项目,重复 k 次。或者,如果允许返回重复编号,则不要删除该项目。

Unless you have a VERY large N, and very stringent performance requirements, this algorithm runs with O(N*k)complexity, which should be acceptable.

除非你有一个非常大的 N 和非常严格的性能要求,否则这个算法运行起来很O(N*k)复杂,这应该是可以接受的。

Edit: Nevermind, Tom Hawtin's method is way better. Select the random numbers first, then traverse the list once. Same theoretical complexity, I think, but much better expected runtime.

编辑:没关系,汤姆霍廷的方法要好得多。首先选择随机数,然后遍历列表一次。我认为理论上的复杂性相同,但预期运行时间要好得多。

回答by George Mauer

Why can't you just do something like

为什么你不能做这样的事情

List GetKRandomFromList(List input, int k)
  List ret = new List();
  for(i=0;i<k;i++)
    ret.Add(input[Math.Rand(0,input.Length)]);
  return ret;

I'm sure that you don't mean something that simple so can you specify further?

我确定您的意思不是那么简单,所以您可以进一步说明吗?