Java 从集合中选择随机子集的最佳方法?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/136474/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
best way to pick a random subset from a collection?
提问by Tom
I have a set of objects in a Vector from which I'd like to select a random subset (e.g. 100 items coming back; pick 5 randomly). In my first (very hasty) pass I did an extremely simple and perhaps overly clever solution:
我在 Vector 中有一组对象,我想从中选择一个随机子集(例如,返回的 100 个项目;随机选择 5 个)。在我的第一次(非常仓促)通过时,我做了一个非常简单而且可能过于聪明的解决方案:
Vector itemsVector = getItems();
Collections.shuffle(itemsVector);
itemsVector.setSize(5);
While this has the advantage of being nice and simple, I suspect it's not going to scale very well, i.e. Collections.shuffle() must be O(n) at least. My less clever alternative is
虽然这具有很好和简单的优点,但我怀疑它不会很好地扩展,即 Collections.shuffle() 至少必须是 O(n)。我不太聪明的选择是
Vector itemsVector = getItems();
Random rand = new Random(System.currentTimeMillis()); // would make this static to the class
List subsetList = new ArrayList(5);
for (int i = 0; i < 5; i++) {
// be sure to use Vector.remove() or you may get the same item twice
subsetList.add(itemsVector.remove(rand.nextInt(itemsVector.size())));
}
Any suggestions on better ways to draw out a random subset from a Collection?
关于从集合中抽取随机子集的更好方法的任何建议?
采纳答案by Jonathan Leffler
Jon Bentley discusses this in either 'Programming Pearls' or 'More Programming Pearls'. You need to be careful with your N of M selection process, but I think the code shown works correctly. Rather than randomly shuffle all the items, you can do the random shuffle only shuffling the first N positions - which is a useful saving when N << M.
Jon Bentley 在“Programming Pearls”或“More Programming Pearls”中对此进行了讨论。你需要小心你的 N of M 选择过程,但我认为显示的代码工作正常。您可以只对前 N 个位置进行随机洗牌,而不是随机洗牌所有项目 - 当 N << M 时,这是一个有用的节省。
Knuth also discusses these algorithms - I believe that would be Vol 3 "Sorting and Searching", but my set is packed pending a move of house so I can't formally check that.
Knuth 还讨论了这些算法——我相信那将是第 3 卷“排序和搜索”,但我的集合已经打包等待搬家,所以我无法正式检查。
回答by qualidafial
Your second solution of using Random to pick element seems sound, however:
但是,您使用 Random 选择元素的第二个解决方案似乎很合理:
Depending on how sensitive your data is, I suggest using some sort of hashing method to scramble the random number seed. For a good case study, see How We Learned to Cheat at Online Poker(but this link is 404 as of 2015-12-18). Alternative URLs (found via a Google search on the article title in double quotes) include:
- How We Learned to Cheat at Online Poker— apparently the original publisher.
- How We Learned to Cheat at Online Poker
- How We Learned to Cheat at Online Poker
Vector is synchronized. If possible, use ArrayList instead to improve performance.
根据您的数据的敏感程度,我建议使用某种散列方法来打乱随机数种子。有关好的案例研究,请参阅我们如何在在线扑克中学会作弊(但截至2015 年 12 月 18 日,此链接为 404)。替代 URL(通过 Google 搜索双引号中的文章标题找到)包括:
- 我们如何在在线扑克中学会作弊——显然是最初的发行商。
- 我们如何学会在在线扑克中作弊
- 我们如何学会在在线扑克中作弊
矢量是同步的。如果可能,请改用 ArrayList 以提高性能。
回答by mmr
How much does remove cost? Because if that needs to rewrite the array to a new chunk of memory, then you've done O(5n) operations in the second version, rather than the O(n) you wanted before.
去除费用是多少?因为如果这需要将数组重写为新的内存块,那么您在第二个版本中已经完成了 O(5n) 操作,而不是您之前想要的 O(n) 操作。
You could create an array of booleans set to false, and then:
您可以创建一个设置为 false 的布尔数组,然后:
for (int i = 0; i < 5; i++){
int r = rand.nextInt(itemsVector.size());
while (boolArray[r]){
r = rand.nextInt(itemsVector.size());
}
subsetList.add(itemsVector[r]);
boolArray[r] = true;
}
This approach works if your subset is smaller than your total size by a significant margin. As those sizes get close to one another (ie, 1/4 the size or something), you'd get more collisions on that random number generator. In that case, I'd make a list of integers the size of your larger array, and then shuffle that list of integers, and pull off the first elements from that to get your (non-colliding) indeces. That way, you have the cost of O(n) in building the integer array, and another O(n) in the shuffle, but no collisions from an internal while checker and less than the potential O(5n) that remove may cost.
如果您的子集比您的总大小小很多,则此方法有效。随着这些大小彼此接近(即,大小的 1/4 或其他大小),您会在该随机数生成器上遇到更多冲突。在这种情况下,我会创建一个与您的较大数组大小相同的整数列表,然后对该整数列表进行洗牌,并从中取出第一个元素以获得您的(非碰撞)索引。这样,您在构建整数数组时有 O(n) 的成本,在 shuffle 中有另一个 O(n) 的成本,但没有来自内部 while 检查器的冲突,并且小于潜在的 O(5n) 删除可能成本。
回答by daniel
I'd personal opt for your initial implementation: very concise. Performance testing will show how well it scales. I've implemented a very similar block of code in a decently abused method and it scaled sufficiently. The particular code relied on arrays containing >10,000 items as well.
我个人选择您的初始实现:非常简洁。性能测试将显示它的扩展性。我已经在一个相当被滥用的方法中实现了一个非常相似的代码块,并且它可以充分扩展。特定的代码也依赖于包含 >10,000 项的数组。
回答by daniel
@Jonathan,
@乔纳森,
I believe this is the solution you're talking about:
我相信这是您正在谈论的解决方案:
void genknuth(int m, int n)
{ for (int i = 0; i < n; i++)
/* select m of remaining n-i */
if ((bigrand() % (n-i)) < m) {
cout << i << "\n";
m--;
}
}
It's on page 127 of Programming Pearls by Jon Bentley and is based off of Knuth's implementation.
它位于 Jon Bentley 的 Programming Pearls 的第 127 页,基于 Knuth 的实现。
EDIT: I just saw a further modification on page 129:
编辑:我刚刚在第 129 页看到了进一步的修改:
void genshuf(int m, int n)
{ int i,j;
int *x = new int[n];
for (i = 0; i < n; i++)
x[i] = i;
for (i = 0; i < m; i++) {
j = randint(i, n-1);
int t = x[i]; x[i] = x[j]; x[j] = t;
}
sort(x, x+m);
for (i = 0; i< m; i++)
cout << x[i] << "\n";
}
This is based on the idea that "...we need shuffle only the first melements of the array..."
这是基于“...我们只需要洗牌数组的前m 个元素...”的想法。
回答by Wesley Tarle
Set<Integer> s = new HashSet<Integer>()
// add random indexes to s
while(s.size() < 5)
{
s.add(rand.nextInt(itemsVector.size()))
}
// iterate over s and put the items in the list
for(Integer i : s)
{
out.add(itemsVector.get(i));
}
回答by Greg Beech
I wrote an efficient implementation of thisa few weeks back. It's in C# but the translation to Java is trivial (essentially the same code). The plus side is that it's also completely unbiased (which some of the existing answers aren't) - a way to test that is here.
几周前我写了一个有效的实现。它是在 C# 中,但转换为 Java 是微不足道的(本质上是相同的代码)。好的一面是它也完全没有偏见(一些现有的答案不是) -一种测试方法在这里。
It's based on a Durstenfeld implementation of the Fisher-Yates shuffle.
它基于 Fisher-Yates shuffle 的 Durstenfeld 实现。
回答by Dave L.
If you're trying to select k distinct elements from a list of n, the methods you gave above will be O(n) or O(kn), because removing an element from a Vector will cause an arraycopy to shift all the elements down.
如果您试图从 n 的列表中选择 k 个不同的元素,您上面给出的方法将是 O(n) 或 O(kn),因为从 Vector 中删除一个元素将导致 arraycopy 将所有元素向下移动.
Since you're asking for the best way, it depends on what you are allowed to do with your input list.
由于您要求的是最佳方式,因此这取决于您可以对输入列表执行的操作。
If it's acceptable to modify the input list, as in your examples, then you can simply swap k random elements to the beginning of the list and return them in O(k) time like this:
如果修改输入列表是可以接受的,如您的示例,那么您可以简单地将 k 个随机元素交换到列表的开头,并在 O(k) 时间内返回它们,如下所示:
public static <T> List<T> getRandomSubList(List<T> input, int subsetSize)
{
Random r = new Random();
int inputSize = input.size();
for (int i = 0; i < subsetSize; i++)
{
int indexToSwap = i + r.nextInt(inputSize - i);
T temp = input.get(i);
input.set(i, input.get(indexToSwap));
input.set(indexToSwap, temp);
}
return input.subList(0, subsetSize);
}
If the list must end up in the same state it began, you can keep track of the positions you swapped, and then return the list to its original state after copying your selected sublist. This is still an O(k) solution.
如果列表必须以开始时的状态结束,您可以跟踪您交换的位置,然后在复制所选子列表后将列表返回到其原始状态。这仍然是一个 O(k) 解决方案。
If, however, you cannot modify the input list at all and k is much less than n (like 5 from 100), it would be much better not to remove selected elements each time, but simply select each element, and if you ever get a duplicate, toss it out and reselect. This will give you O(kn / (n-k)) which is still close to O(k) when n dominates k. (For example, if k is less than n / 2, then it reduces to O(k)).
但是,如果您根本无法修改输入列表并且 k 远小于 n(例如 100 中的 5),那么最好不要每次都删除所选元素,而只需选择每个元素,如果您得到一个重复的,把它扔掉并重新选择。这会给你 O(kn / (nk)) 当 n 支配 k 时,它仍然接近 O(k)。(例如,如果 k 小于 n / 2,则它减少到 O(k))。
If k not dominated by n, and you cannot modify the list, you might as well copy your original list, and use your first solution, because O(n) will be just as good as O(k).
如果 k 不受 n 支配,并且您无法修改列表,您不妨复制您的原始列表,并使用您的第一个解决方案,因为 O(n) 将与 O(k) 一样好。
As others have noted, if you are depending on strong randomness where every sublist is possible (and unbiased), you'll definitely need something stronger than java.util.Random
. See java.security.SecureRandom
.
正如其他人所指出的,如果您依赖强随机性,其中每个子列表都是可能的(并且是无偏见的),那么您肯定需要比java.util.Random
. 见java.security.SecureRandom
。
回答by Tyler
Thisis a very similar question on stackoverflow.
这是关于stackoverflow的一个非常相似的问题。
To summarize my favorite answers from that page (furst one from user Kyle):
总结该页面上我最喜欢的答案(第一个来自用户 Kyle):
- O(n) solution: Iterate through your list, and copy out an element (or reference thereto) with probability (#needed / #remaining). Example: if k = 5 and n = 100, then you take the first element with prob 5/100. If you copy that one, then you choose the next with prob 4/99; but if you didn't take the first one, the prob is 5/99.
- O(k log k) or O(k2): Build a sorted list of k indices (numbers in {0, 1, ..., n-1}) by randomly choosing a number < n, then randomly choosing a number < n-1, etc. At each step, you need to recallibrate your choice to avoid collisions and keep the probabilities even. As an example, if k=5 and n=100, and your first choice is 43, your next choice is in the range [0, 98], and if it's >=43, then you add 1 to it. So if your second choice is 50, then you add 1 to it, and you have {43, 51}. If your next choice is 51, you add 2to it to get {43, 51, 53}.
- O(n) 解决方案:遍历您的列表,并以概率 (#needed / #remaining) 复制出一个元素(或对其的引用)。示例:如果 k = 5 且 n = 100,则取概率为 5/100 的第一个元素。如果你复制那个,那么你选择下一个概率为 4/99;但如果你没有参加第一个,概率是 5/99。
- O(k log k) 或 O(k 2):通过随机选择一个小于 n 的数字,然后随机选择一个数字,构建 k 个索引({0, 1, ..., n-1} 中的数字)的排序列表< n-1 等。在每一步,您都需要重新调整您的选择以避免冲突并保持概率均匀。例如,如果 k=5 且 n=100,并且您的第一个选择是 43,那么您的下一个选择在 [0, 98] 范围内,如果 >=43,那么您将其加 1。所以如果你的第二个选择是 50,那么你加 1,你就有 {43, 51}。如果您的下一个选择是 51,则向其添加2以获得 {43, 51, 53}。
Here is some pseudopython -
这是一些伪蟒蛇 -
# Returns a container s with k distinct random numbers from {0, 1, ..., n-1}
def ChooseRandomSubset(n, k):
for i in range(k):
r = UniformRandom(0, n-i) # May be 0, must be < n-i
q = s.FirstIndexSuchThat( s[q] - q > r ) # This is the search.
s.InsertInOrder(q ? r + q : r + len(s)) # Inserts right before q.
return s
I'm saying that the time complexity is O(k2) orO(k log k) because it depends on how quickly you can search and insert into your container for s. If s is a normal list, one of those operations is linear, and you get k^2. However, if you're willing to build s as a balanced binary tree, you can get out the O(k log k) time.
我是说时间复杂度是 O(k 2)或O(k log k) 因为这取决于您可以多快搜索并插入到容器中的 s。如果 s 是一个普通列表,其中一个操作是线性的,你会得到 k^2。但是,如果您愿意将 s 构建为平衡二叉树,则可以获得 O(k log k) 时间。
回答by user967710
two solutions I don't think appear here - the corresponds is quite long, and contains some links, however, I don't think all of the posts relate to the problem of choosing a subst of K elemetns out of a set of N elements. [By "set", I refer to the mathematical term, i.e. all elements appear once, order is not important].
我认为这里不会出现两个解决方案 - 对应很长,并且包含一些链接,但是,我认为并非所有帖子都与从一组 N 个元素中选择 K 个元素的子集有关. [通过“设置”,我指的是数学术语,即所有元素都出现一次,顺序不重要]。
Sol 1:
溶胶 1:
//Assume the set is given as an array:
Object[] set ....;
for(int i=0;i<K; i++){
randomNumber = random() % N;
print set[randomNumber];
//swap the chosen element with the last place
temp = set[randomName];
set[randomName] = set[N-1];
set[N-1] = temp;
//decrease N
N--;
}
This looks similar to the answer daniel gave, but it actually is very different. It is of O(k) run time.
这看起来与 daniel 给出的答案相似,但实际上却大不相同。它的运行时间为 O(k)。
Another solution is to use some math: consider the array indexes as Z_n and so we can choose randomly 2 numbers, x which is co-prime to n, i.e. chhose gcd(x,n)=1, and another, a, which is "starting point" - then the series: a % n,a+x % n, a+2*x % n,...a+(k-1)*x%n is a sequence of distinct numbers (as long as k<=n).
另一种解决方案是使用一些数学运算:将数组索引视为 Z_n,因此我们可以随机选择 2 个数字,x 与 n 互质,即选择 gcd(x,n)=1,另一个是 a,即“起点” - 那么系列:a % n,a+x % n, a+2*x % n,...a+(k-1)*x%n 是不同数字的序列(只要k<=n)。