C++ 为什么人们说使用随机数生成器时存在模偏差?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/10984974/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why do people say there is modulo bias when using a random number generator?
提问by user1413793
I have seen this question asked a lot but never seen a true concrete answer to it. So I am going to post one here which will hopefully help people understand why exactly there is "modulo bias" when using a random number generator, like rand()
in C++.
我看到这个问题问了很多,但从未见过真正具体的答案。所以我将在这里发布一个,希望能帮助人们理解为什么在使用随机数生成器(例如rand()
在 C++ 中)时究竟存在“模偏差” 。
回答by user1413793
So rand()
is a pseudo-random number generator which chooses a natural number between 0 and RAND_MAX
, which is a constant defined in cstdlib
(see this articlefor a general overview on rand()
).
所以rand()
是选择0之间的自然数和伪随机数发生器RAND_MAX
,它是在定义的常数cstdlib
(见本文章有关的一般概述rand()
)。
Now what happens if you want to generate a random number between say 0 and 2? For the sake of explanation, let's say RAND_MAX
is 10 and I decide to generate a random number between 0 and 2 by calling rand()%3
. However, rand()%3
does not produce the numbers between 0 and 2 with equal probability!
现在如果你想生成一个介于 0 和 2 之间的随机数会发生什么?为了解释起见,假设RAND_MAX
是 10,我决定通过调用rand()%3
. 但是,rand()%3
不会以相等的概率产生 0 到 2 之间的数字!
When rand()
returns 0, 3, 6, or 9,rand()%3 == 0
. Therefore, P(0) = 4/11
当rand()
返回 0、3、6 或 9 时,rand()%3 == 0
. 因此,P(0) = 4/11
When rand()
returns 1, 4, 7, or 10,rand()%3 == 1
. Therefore, P(1) = 4/11
当rand()
返回 1、4、7 或 10 时,rand()%3 == 1
. 因此,P(1) = 4/11
When rand()
returns 2, 5, or 8,rand()%3 == 2
. Therefore, P(2) = 3/11
当rand()
返回 2、5 或 8 时,rand()%3 == 2
. 因此,P(2) = 3/11
This does not generate the numbers between 0 and 2 with equal probability. Of course for small ranges this might not be the biggest issue but for a larger range this could skew the distribution, biasing the smaller numbers.
这不会以相等的概率生成 0 到 2 之间的数字。当然,对于小范围,这可能不是最大的问题,但对于较大的范围,这可能会扭曲分布,偏向较小的数字。
So when does rand()%n
return a range of numbers from 0 to n-1 with equal probability? When RAND_MAX%n == n - 1
. In this case, along with our earlier assumption rand()
does return a number between 0 and RAND_MAX
with equal probability, the modulo classes of n would also be equally distributed.
那么什么时候rand()%n
以相等的概率返回从 0 到 n-1 的数字范围呢?当RAND_MAX%n == n - 1
. 在这种情况下,除了我们之前的假设rand()
确实返回一个介于 0 和RAND_MAX
相等概率之间的数字之外,n 的模类也将均匀分布。
So how do we solve this problem? A crude way is to keep generating random numbers until you get a number in your desired range:
那么我们如何解决这个问题呢?一种粗略的方法是不断生成随机数,直到获得所需范围内的数字:
int x;
do {
x = rand();
} while (x >= n);
but that's inefficient for low values of n
, since you only have a n/RAND_MAX
chance of getting a value in your range, and so you'll need to perform RAND_MAX/n
calls to rand()
on average.
但这对于 的低值来说效率低下n
,因为您只有n/RAND_MAX
机会获得您范围内的值,因此您需要平均RAND_MAX/n
调用rand()
。
A more efficient formula approach would be to take some large range with a length divisible by n
, like RAND_MAX - RAND_MAX % n
, keep generating random numbers until you get one that lies in the range, and then take the modulus:
一种更有效的公式方法是取一些长度可被 整除的大范围n
,例如RAND_MAX - RAND_MAX % n
,继续生成随机数,直到得到一个位于该范围内的随机数,然后取模数:
int x;
do {
x = rand();
} while (x >= (RAND_MAX - RAND_MAX % n));
x %= n;
For small values of n
, this will rarely require more than one call to rand()
.
对于较小的 值n
,这很少需要多次调用rand()
。
Works cited and further reading:
引用的作品和进一步阅读:
回答by Nick Dandoulakis
Keep selecting a random is a good way to remove the bias.
保持随机选择是消除偏差的好方法。
Update
更新
We could make the code fast if we search for an x in range divisible by n
.
如果我们在范围内搜索可被 整除的 x,我们可以使代码快速n
。
// Assumptions
// rand() in [0, RAND_MAX]
// n in (0, RAND_MAX]
int x;
// Keep searching for an x in a range divisible by n
do {
x = rand();
} while (x >= RAND_MAX - (RAND_MAX % n))
x %= n;
The above loop should be very fast, say 1 iteration on average.
上面的循环应该非常快,比如平均 1 次迭代。
回答by Rob Napier
@user1413793 is correct about the problem. I'm not going to discuss that further, except to make one point: yes, for small values of n
and large values of RAND_MAX
, the modulo bias can be very small. But using a bias-inducing pattern means that you must consider the bias every time you calculate a random number and choose different patterns for different cases. And if you make the wrong choice, the bugs it introduces are subtle and almost impossible to unit test. Compared to just using the proper tool (such as arc4random_uniform
), that's extra work, not less work. Doing more work and getting a worse solution is terrible engineering, especially when doing it right every time is easy on most platforms.
@user1413793 关于这个问题是正确的。我不打算进一步讨论这一点,只是要说明一点:是的,对于 的小值n
和大值RAND_MAX
,模偏差可能非常小。但是使用偏差诱导模式意味着每次计算随机数时都必须考虑偏差,并为不同的情况选择不同的模式。如果你做出错误的选择,它引入的错误是微妙的,几乎不可能进行单元测试。与仅使用适当的工具(例如arc4random_uniform
)相比,这是额外的工作,而不是更少的工作。做更多的工作并得到更糟糕的解决方案是糟糕的工程,尤其是在大多数平台上每次都做对时很容易。
Unfortunately, the implementations of the solution are all incorrect or less efficient than they should be. (Each solution has various comments explaining the problems, but none of the solutions have been fixed to address them.) This is likely to confuse the casual answer-seeker, so I'm providing a known-good implementation here.
不幸的是,该解决方案的实现都是不正确的或效率低于应有的水平。(每个解决方案都有解释问题的各种注释,但没有一个解决方案已被修复以解决这些问题。)这可能会使临时寻求答案的人感到困惑,因此我在这里提供了一个已知良好的实现。
Again, the best solution is just to use arc4random_uniform
on platforms that provide it, or a similar ranged solution for your platform (such as Random.nextInt
on Java). It will do the right thing at no code cost to you. This is almost always the correct call to make.
同样,最好的解决方案是arc4random_uniform
在提供它的平台上使用,或者在您的平台上使用类似的范围解决方案(例如Random.nextInt
在 Java 上)。它将为您做正确的事情,而无需您付出任何代码成本。这几乎总是正确的呼吁。
If you don't have arc4random_uniform
, then you can use the power of opensource to see exactly how it is implemented on top of a wider-range RNG (ar4random
in this case, but a similar approach could also work on top of other RNGs).
如果您没有arc4random_uniform
,那么您可以使用开源的强大功能来准确了解它是如何在更广泛的 RNG 之上实现的(ar4random
在这种情况下,但类似的方法也可以在其他 RNG 之上工作)。
Here is the OpenBSD implementation:
这是OpenBSD 的实现:
/*
* Calculate a uniformly distributed random number less than upper_bound
* avoiding "modulo bias".
*
* Uniformity is achieved by generating new random numbers until the one
* returned is outside the range [0, 2**32 % upper_bound). This
* guarantees the selected random number will be inside
* [2**32 % upper_bound, 2**32) which maps back to [0, upper_bound)
* after reduction modulo upper_bound.
*/
u_int32_t
arc4random_uniform(u_int32_t upper_bound)
{
u_int32_t r, min;
if (upper_bound < 2)
return 0;
/* 2**32 % x == (2**32 - x) % x */
min = -upper_bound % upper_bound;
/*
* This could theoretically loop forever but each retry has
* p > 0.5 (worst case, usually far better) of selecting a
* number inside the range we need, so it should rarely need
* to re-roll.
*/
for (;;) {
r = arc4random();
if (r >= min)
break;
}
return r % upper_bound;
}
It is worth noting the latest commit comment on this code for those who need to implement similar things:
值得注意的是,对于那些需要实现类似事情的人,该代码的最新提交评论是:
Change arc4random_uniform() to calculate
2**32 % upper_bound
as-upper_bound % upper_bound
. Simplifies the code and makes it the same on both ILP32 and LP64 architectures, and also slightly faster on LP64 architectures by using a 32-bit remainder instead of a 64-bit remainder.Pointed out by Jorden Verwer on tech@ ok deraadt; no objections from djm or otto
将 arc4random_uniform() 更改为计算
2**32 % upper_bound
为-upper_bound % upper_bound
。简化代码并使其在 ILP32 和 LP64 架构上相同,并且在 LP64 架构上通过使用 32 位余数而不是 64 位余数稍微快一点。Jorden Verwer 在 tech@ok deraadt 上指出;djm 或 otto 没有反对意见
The Java implementation is also easily findable (see previous link):
Java 实现也很容易找到(请参阅上一个链接):
public int nextInt(int n) {
if (n <= 0)
throw new IllegalArgumentException("n must be positive");
if ((n & -n) == n) // i.e., n is a power of 2
return (int)((n * (long)next(31)) >> 31);
int bits, val;
do {
bits = next(31);
val = bits % n;
} while (bits - val + (n-1) < 0);
return val;
}
回答by Jim Wood
Definition
定义
Modulo Biasis the inherent bias in using modulo arithmetic to reduce an output set to a subset of the input set. In general, a bias exists whenever the mapping between the input and output set is not equally distributed, as in the case of using modulo arithmetic when the size of the output set is not a divisor of the size of the input set.
模偏差是使用模运算将输出集减少到输入集的子集时的固有偏差。通常,只要输入和输出集之间的映射分布不均,就会存在偏差,例如在输出集的大小不是输入集大小的除数时使用模运算的情况。
This bias is particularly hard to avoid in computing, where numbers are represented as strings of bits: 0s and 1s. Finding truly random sources of randomness is also extremely difficult, but is beyond the scope of this discussion. For the remainder of this answer, assume that there exists an unlimited source of truly random bits.
这种偏差在计算中特别难以避免,其中数字表示为位串:0 和 1。寻找真正随机的随机源也极其困难,但这超出了本讨论的范围。对于本答案的其余部分,假设存在无限的真正随机位来源。
Problem Example
问题示例
Let's consider simulating a die roll (0 to 5) using these random bits. There are 6 possibilities, so we need enough bits to represent the number 6, which is 3 bits. Unfortunately, 3 random bits yields 8 possible outcomes:
让我们考虑使用这些随机位模拟掷骰子(0 到 5)。有 6 种可能性,因此我们需要足够的位来表示数字 6,即 3 位。不幸的是,3 个随机位会产生 8 种可能的结果:
000 = 0, 001 = 1, 010 = 2, 011 = 3
100 = 4, 101 = 5, 110 = 6, 111 = 7
We can reduce the size of the outcome set to exactly 6 by taking the value modulo 6, however this presents the modulo biasproblem: 110
yields a 0, and 111
yields a 1. This die is loaded.
我们可以通过取模 6 的值来将结果集的大小精确地减小到 6,但这会带来模偏差问题:110
产生 0,111
产生 1。这个骰子被加载。
Potential Solutions
潜在的解决方案
Approach 0:
方法0:
Rather than rely on random bits, in theory one could hire a small army to roll dice all day and record the results in a database, and then use each result only once. This is about as practical as it sounds, and more than likely would not yield truly random results anyway (pun intended).
与其依赖随机位,理论上可以雇佣一支小部队整天掷骰子并将结果记录在数据库中,然后每个结果只使用一次。这听起来很实用,而且很可能不会产生真正随机的结果(双关语)。
Approach 1:
方法一:
Instead of using the modulus, a naive but mathematically correct solution is to discard results that yield 110
and 111
and simply try again with 3 new bits. Unfortunately, this means that there is a 25% chance on each roll that a re-roll will be required, including each of the re-rollsthemselves. This is clearly impractical for all but the most trivial of uses.
除了使用模量,天真但数学正确的办法是丢弃结果产量110
和111
和简单的3个新位再试一次。不幸的是,这意味着每次掷骰有25% 的机会需要重新掷骰,包括每次重新掷骰本身。除了最微不足道的用途外,这显然不切实际。
Approach 2:
方法二:
Use more bits: instead of 3 bits, use 4. This yield 16 possible outcomes. Of course, re-rolling anytime the result is greater than 5 makes things worse (10/16 = 62.5%) so that alone won't help.
使用更多位:而不是 3 位,使用 4。这会产生 16 种可能的结果。当然,在结果大于 5 的任何时候重新滚动会使情况变得更糟 (10/16 = 62.5%),因此仅凭这一点是无济于事的。
Notice that 2 * 6 = 12 < 16, so we can safely take any outcome less than 12 and reduce that modulo 6 to evenly distribute the outcomes. The other 4 outcomes must be discarded, and then re-rolled as in the previous approach.
请注意,2 * 6 = 12 < 16,因此我们可以安全地取任何小于 12 的结果并减少该模 6 以均匀分布结果。其他 4 个结果必须丢弃,然后像之前的方法一样重新滚动。
Sounds good at first, but let's check the math:
一开始听起来不错,但让我们检查一下数学:
4 discarded results / 16 possibilities = 25%
In this case, 1 extra bit didn't helpat all!
在这种情况下,额外的 1 位根本没有帮助!
That result is unfortunate, but let's try again with 5 bits:
这个结果很不幸,但让我们用 5 位再试一次:
32 % 6 = 2 discarded results; and
2 discarded results / 32 possibilities = 6.25%
A definite improvement, but not good enough in many practical cases. The good news is, adding more bits will never increase the chances of needing to discard and re-roll. This holds not just for dice, but in all cases.
一个明确的改进,但在许多实际情况下还不够好。好消息是,添加更多位永远不会增加需要丢弃和重新滚动的机会。这不仅适用于骰子,而且适用于所有情况。
As demonstrated however, adding an 1 extra bit may not change anything.In fact if we increase our roll to 6 bits, the probability remains 6.25%.
然而,正如所演示的,添加 1 个额外位可能不会改变任何东西。事实上,如果我们将滚动增加到 6 位,概率仍然是 6.25%。
This begs 2 additional questions:
这引出了两个额外的问题:
- If we add enough bits, is there a guarantee that the probability of a discard will diminish?
- How many bits are enoughin the general case?
- 如果我们添加足够多的位,是否可以保证丢弃的概率会减少?
- 一般情况下多少位就足够了?
General Solution
通用解决方案
Thankfully the answer to the first question is yes. The problem with 6 is that 2^x mod 6 flips between 2 and 4 which coincidentally are a multiple of 2 from each other, so that for an even x > 1,
谢天谢地,第一个问题的答案是肯定的。6 的问题是 2^x mod 6 在 2 和 4 之间翻转,巧合的是彼此是 2 的倍数,因此对于偶数 x > 1,
[2^x mod 6] / 2^x == [2^(x+1) mod 6] / 2^(x+1)
Thus 6 is an exception rather than the rule. It is possible to find larger moduli that yield consecutive powers of 2 in the same way, but eventually this must wrap around, and the probability of a discard will be reduced.
因此,6 是一个例外而不是规则。可以找到更大的模,以相同的方式产生连续的 2 次幂,但最终这必须环绕,并且丢弃的概率将降低。
Without offering further proof, in general using double the number of bits requiredwill provide a smaller, usually insignificant, chance of a discard.
在不提供进一步证明的情况下,通常使用所需位数的两倍将提供较小的、通常微不足道的丢弃机会。
Proof of Concept
概念证明
Here is an example program that uses OpenSSL's libcrypo to supply random bytes. When compiling, be sure to link to the library with -lcrypto
which most everyone should have available.
这是一个使用 OpenSSL 的 libcrypo 提供随机字节的示例程序。编译时,一定要链接到-lcrypto
大多数人都应该使用的库。
#include <iostream>
#include <assert.h>
#include <limits>
#include <openssl/rand.h>
volatile uint32_t dummy;
uint64_t discardCount;
uint32_t uniformRandomUint32(uint32_t upperBound)
{
assert(RAND_status() == 1);
uint64_t discard = (std::numeric_limits<uint64_t>::max() - upperBound) % upperBound;
uint64_t randomPool = RAND_bytes((uint8_t*)(&randomPool), sizeof(randomPool));
while(randomPool > (std::numeric_limits<uint64_t>::max() - discard)) {
RAND_bytes((uint8_t*)(&randomPool), sizeof(randomPool));
++discardCount;
}
return randomPool % upperBound;
}
int main() {
discardCount = 0;
const uint32_t MODULUS = (1ul << 31)-1;
const uint32_t ROLLS = 10000000;
for(uint32_t i = 0; i < ROLLS; ++i) {
dummy = uniformRandomUint32(MODULUS);
}
std::cout << "Discard count = " << discardCount << std::endl;
}
I encourage playing with the MODULUS
and ROLLS
values to see how many re-rolls actually happen under most conditions. A sceptical person may also wish to save the computed values to file and verify the distribution appears normal.
我鼓励使用MODULUS
和ROLLS
值来查看在大多数情况下实际发生了多少次重投。持怀疑态度的人也可能希望将计算值保存到文件中并验证分布是否正常。
回答by AProgrammer
There are two usual complaints with the use of modulo.
模数的使用有两个常见的抱怨。
one is valid for all generators. It is easier to see in a limit case. If your generator has a RAND_MAX which is 2 (that isn't compliant with the C standard) and you want only 0 or 1 as value, using modulo will generate 0 twice as often (when the generator generates 0 and 2) as it will generate 1 (when the generator generates 1). Note that this is true as soon as you don't drop values, whatever the mapping you are using from the generator values to the wanted one, one will occurs twice as often as the other.
some kind of generator have their less significant bits less random than the other, at least for some of their parameters, but sadly those parameter have other interesting characteristic (such has being able to have RAND_MAX one less than a power of 2). The problem is well known and for a long time library implementation probably avoid the problem (for instance the sample rand() implementation in the C standard use this kind of generator, but drop the 16 less significant bits), but some like to complain about that and you may have bad luck
一个对所有生成器都有效。在极限情况下更容易看出。如果您的生成器的 RAND_MAX 为 2(不符合 C 标准),并且您只需要 0 或 1 作为值,则使用模将生成 0 的频率(当生成器生成 0 和 2 时)的两倍生成 1(当生成器生成 1 时)。请注意,只要您不删除值,这是正确的,无论您使用的是从生成器值到所需值的映射,一个发生的频率是另一个的两倍。
某种生成器的低重要位比另一种更不随机,至少对于它们的某些参数而言,但遗憾的是,这些参数具有其他有趣的特性(例如能够使 RAND_MAX 小于 2 的幂)。这个问题是众所周知的,很长一段时间的库实现可能会避免这个问题(例如,C 标准中的示例 rand() 实现使用这种生成器,但删除了 16 个不太重要的位),但有些人喜欢抱怨那你可能运气不好
Using something like
使用类似的东西
int alea(int n){
assert (0 < n && n <= RAND_MAX);
int partSize =
n == RAND_MAX ? 1 : 1 + (RAND_MAX-n)/(n+1);
int maxUsefull = partSize * n + (partSize-1);
int draw;
do {
draw = rand();
} while (draw > maxUsefull);
return draw/partSize;
}
to generate a random number between 0 and n will avoid both problems (and it avoids overflow with RAND_MAX == INT_MAX)
生成 0 和 n 之间的随机数将避免这两个问题(并且它避免了 RAND_MAX == INT_MAX 溢出)
BTW, C++11 introduced standard ways to the the reduction and other generator than rand().
顺便说一句,C++11 引入了标准的归约方法和除 rand() 之外的其他生成器。
回答by Ben Personick
Mark's Solution (The accepted solution) is Nearly Perfect.
马克的解决方案(公认的解决方案)是近乎完美的。
int x; do { x = rand(); } while (x >= (RAND_MAX - RAND_MAX % n)); x %= n;
edited Mar 25 '16 at 23:16
Mark Amery 39k21170211
int x; do { x = rand(); } while (x >= (RAND_MAX - RAND_MAX % n)); x %= n;
16 年 3 月 25 日 23:16 编辑
马克·艾默里 39k21170211
However, it has a caveat which discards 1 valid set of outcomes in any scenario where RAND_MAX
(RM
) is 1 less than a multiple of N
(Where N
= the Number of possible valid outcomes).
但是,它有一个警告,即在RAND_MAX
( RM
) 小于 1 的倍数N
(其中N
= 可能的有效结果的数量)的任何情况下,都会丢弃 1 个有效的结果集。
ie, When the 'count of values discarded' (D
) is equal to N
, then they are actually a valid set (V)
, not an invalid set (I
).
即,当“丢弃的值的计数” ( D
) 等于 时N
,它们实际上是一个有效的集合 ( V)
,而不是无效的集合 ( I
)。
What causes this is at some point Mark loses sight of the difference between N
and Rand_Max
.
是什么原因导致这在某些时候马克失去视力之间的差异N
和Rand_Max
。
N
is a set who's valid members are comprised only of Positive Integers, as it contains a count of responses that would be valid. (eg: Set N
= {1, 2, 3, ... n }
)
N
是有效成员仅由正整数组成的集合,因为它包含有效响应的计数。(例如:设置N
= {1, 2, 3, ... n }
)
Rand_max
However is a set which ( as defined for our purposes ) includes any number of non-negative integers.
Rand_max
然而,一个集合(根据我们的目的定义)包括任意数量的非负整数。
In it's most generic form, what is defined here as Rand Max
is the Set of all valid outcomes, which could theoretically include negative numbers or non-numeric values.
在最通用的形式中,这里定义的Rand Max
是所有有效结果的集合,理论上可以包括负数或非数字值。
Therefore Rand_Max
is better defined as the set of "Possible Responses".
因此Rand_Max
,最好将其定义为“可能的响应”集。
However N
operates against the count of the values within the set of valid responses, so even as defined in our specific case, Rand_Max
will be a value one less than the total number it contains.
然而,N
针对有效响应集中的值的计数进行操作,因此即使在我们的特定情况下定义,Rand_Max
值也将比它包含的总数小 1。
Using Mark's Solution, Values are Discarded when: X => RM - RM % N
使用 Mark 的解决方案,在以下情况下丢弃值:X => RM - RM % N
EG:
Ran Max Value (RM) = 255
Valid Outcome (N) = 4
When X => 252, Discarded values for X are: 252, 253, 254, 255
So, if Random Value Selected (X) = {252, 253, 254, 255}
Number of discarded Values (I) = RM % N + 1 == N
IE:
I = RM % N + 1
I = 255 % 4 + 1
I = 3 + 1
I = 4
X => ( RM - RM % N )
255 => (255 - 255 % 4)
255 => (255 - 3)
255 => (252)
Discard Returns $True
As you can see in the example above, when the value of X (the random number we get from the initial function) is 252, 253, 254, or 255 we would discard it even though these four values comprise a valid set of returned values.
正如你在上面的例子中看到的,当 X 的值(我们从初始函数中得到的随机数)是 252、253、254 或 255 时,我们会丢弃它,即使这四个值组成了一组有效的返回值.
IE: When the count of the values Discarded (I) = N (The number of valid outcomes) then a Valid set of return values will be discarded by the original function.
IE:当丢弃的值的计数(I)= N(有效结果的数量)时,原始函数将丢弃一组有效的返回值。
If we describe the difference between the values N and RM as D, ie:
如果我们将 N 和 RM 之间的差值描述为 D,即:
D = (RM - N)
Then as the value of D becomes smaller, the Percentage of unneeded re-rolls due to this method increases at each natural multiplicative. (When RAND_MAX is NOT equal to a Prime Number this is of valid concern)
然后,随着 D 的值变小,由于这种方法而导致的不需要重新滚动的百分比在每个自然乘法时都会增加。(当 RAND_MAX 不等于质数时,这是值得关注的)
EG:
例如:
RM=255 , N=2 Then: D = 253, Lost percentage = 0.78125%
RM=255 , N=4 Then: D = 251, Lost percentage = 1.5625%
RM=255 , N=8 Then: D = 247, Lost percentage = 3.125%
RM=255 , N=16 Then: D = 239, Lost percentage = 6.25%
RM=255 , N=32 Then: D = 223, Lost percentage = 12.5%
RM=255 , N=64 Then: D = 191, Lost percentage = 25%
RM=255 , N= 128 Then D = 127, Lost percentage = 50%
Since the percentage of Rerolls needed increases the closer N comes to RM, this can be of valid concern at many different values depending on the constraints of the system running he code and the values being looked for.
由于 N 越接近 RM,所需的 Rerolls 百分比就会增加,因此根据运行代码的系统的约束和正在查找的值,这对于许多不同的值可能是值得关注的。
To negate this we can make a simple amendment As shown here:
为了否定这一点,我们可以做一个简单的修改,如下所示:
int x;
do {
x = rand();
} while (x > (RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) );
x %= n;
This provides a more general version of the formula which accounts for the additional peculiarities of using modulus to define your max values.
这提供了一个更通用的公式版本,它解释了使用模数定义最大值的额外特性。
Examples of using a small value for RAND_MAX which is a multiplicative of N.
使用 RAND_MAX 的小值的示例,它是 N 的乘法。
Mark'original Version:
马克原版:
RAND_MAX = 3, n = 2, Values in RAND_MAX = 0,1,2,3, Valid Sets = 0,1 and 2,3.
When X >= (RAND_MAX - ( RAND_MAX % n ) )
When X >= 2 the value will be discarded, even though the set is valid.
Generalized Version 1:
通用版本 1:
RAND_MAX = 3, n = 2, Values in RAND_MAX = 0,1,2,3, Valid Sets = 0,1 and 2,3.
When X > (RAND_MAX - ( ( RAND_MAX % n ) + 1 ) % n )
When X > 3 the value would be discarded, but this is not a vlue in the set RAND_MAX so there will be no discard.
Additionally, in the case where N should be the number of values in RAND_MAX; in this case, you could set N = RAND_MAX +1, unless RAND_MAX = INT_MAX.
此外,在 N 应该是 RAND_MAX 中值的数量的情况下;在这种情况下,您可以设置 N = RAND_MAX +1,除非 RAND_MAX = INT_MAX。
Loop-wise you could just use N = 1, and any value of X will be accepted, however, and put an IF statement in for your final multiplier. But perhaps you have code that may have a valid reason to return a 1 when the function is called with n = 1...
循环方面,您可以只使用 N = 1,并且 X 的任何值都将被接受,但是,并为您的最终乘数放入 IF 语句。但是,当使用 n = 1 调用函数时,您的代码可能有正当理由返回 1 ...
So it may be better to use 0, which would normally provide a Div 0 Error, when you wish to have n = RAND_MAX+1
因此,当您希望 n = RAND_MAX+1 时,最好使用 0,这通常会提供 Div 0 错误
Generalized Version 2:
通用版本 2:
int x;
if n != 0 {
do {
x = rand();
} while (x > (RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) );
x %= n;
} else {
x = rand();
}
Both of these solutions resolve the issue with needlessly discarded valid results which will occur when RM+1 is a product of n.
这两种解决方案都解决了当 RM+1 是 n 的乘积时会发生的不必要丢弃有效结果的问题。
The second version also covers the edge case scenario when you need n to equal the total possible set of values contained in RAND_MAX.
当您需要 n 等于 RAND_MAX 中包含的所有可能值集时,第二个版本还涵盖了边缘情况。
The modified approach in both is the same and allows for a more general solution to the need of providing valid random numbers and minimizing discarded values.
两者中修改后的方法是相同的,并且允许更通用的解决方案来满足提供有效随机数和最小化丢弃值的需要。
To reiterate:
重申:
The Basic General Solution which extends mark's example:
扩展标记示例的基本通用解决方案:
// Assumes:
// RAND_MAX is a globally defined constant, returned from the environment.
// int n; // User input, or externally defined, number of valid choices.
int x;
do {
x = rand();
} while (x > (RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) ) );
x %= n;
The Extended General Solution which Allows one additional scenario of RAND_MAX+1 = n:
允许 RAND_MAX+1 = n 的一个额外场景的扩展通用解决方案:
// Assumes:
// RAND_MAX is a globally defined constant, returned from the environment.
// int n; // User input, or externally defined, number of valid choices.
int x;
if n != 0 {
do {
x = rand();
} while (x > (RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) ) );
x %= n;
} else {
x = rand();
}
In some languages ( particularly interpreted languages ) doing the calculations of the compare-operation outside of the while condition may lead to faster results as this is a one-time calculation no matter how many re-tries are required. YMMV!
在某些语言(特别是解释型语言)中,在 while 条件之外进行比较操作的计算可能会导致更快的结果,因为无论需要多少次重试,这都是一次性计算。天啊!
// Assumes:
// RAND_MAX is a globally defined constant, returned from the environment.
// int n; // User input, or externally defined, number of valid choices.
int x; // Resulting random number
int y; // One-time calculation of the compare value for x
if n != 0 {
y = RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n)
do {
x = rand();
} while (x > y);
x %= n;
} else {
x = rand();
}
回答by Rivenfall
With a RAND_MAX
value of 3
(in reality it should be much higher than that but the bias would still exist) it makes sense from these calculations that there is a bias:
随着RAND_MAX
价值3
(实际上它应该是高于很多,但偏置仍存在),它是有道理的,从这些计算是有偏差:
1 % 2 = 1
2 % 2 = 0
3 % 2 = 1
random_between(1, 3) % 2 = more likely a 1
1 % 2 = 1
2 % 2 = 0
3 % 2 = 1
random_between(1, 3) % 2 = more likely a 1
In this case, the % 2
is what you shouldn't do when you want a random number between 0
and 1
. You could get a random number between 0
and 2
by doing % 3
though, because in this case: RAND_MAX
is a multiple of 3
.
在这种情况下,% 2
当您想要0
和之间的随机数时,您不应该这样做1
。通过这样做,您可以在0
和之间获得一个随机数,因为在这种情况下:是 的倍数。2
% 3
RAND_MAX
3
Another method
另一种方法
There is much simpler but to add to other answers, here is my solution to get a random number between 0
and n - 1
, so n
different possibilities, without bias.
有更简单的方法,但要添加到其他答案中,这是我的解决方案,可在0
和之间获得随机数n - 1
,因此有n
不同的可能性,没有偏见。
- the number of bits (not bytes) needed to encode the number of possibilities is the number of bits of random data you'll need
- encode the number from random bits
- if this number is
>= n
, restart (no modulo).
- 编码可能性数量所需的位数(而不是字节)是您需要的随机数据的位数
- 从随机位编码数字
- 如果此数字为
>= n
,则重新启动(无模数)。
Really random data is not easy to obtain, so why use more bits than needed.
真正的随机数据不容易获得,所以为什么要使用比需要更多的位。
Below is an example in Smalltalk, using a cache of bits from a pseudo-random number generator. I'm no security expert so use at your own risk.
下面是 Smalltalk 中的一个示例,使用来自伪随机数生成器的位缓存。我不是安全专家,因此使用风险自负。
next: n
| bitSize r from to |
n < 0 ifTrue: [^0 - (self next: 0 - n)].
n = 0 ifTrue: [^nil].
n = 1 ifTrue: [^0].
cache isNil ifTrue: [cache := OrderedCollection new].
cache size < (self randmax highBit) ifTrue: [
Security.DSSRandom default next asByteArray do: [ :byte |
(1 to: 8) do: [ :i | cache add: (byte bitAt: i)]
]
].
r := 0.
bitSize := n highBit.
to := cache size.
from := to - bitSize + 1.
(from to: to) do: [ :i |
r := r bitAt: i - from + 1 put: (cache at: i)
].
cache removeFrom: from to: to.
r >= n ifTrue: [^self next: n].
^r
回答by bobobobo
As the accepted answerindicates, "modulo bias" has its roots in the low value of RAND_MAX
. He uses an extremely small value of RAND_MAX
(10) to show that if RAND_MAX were 10, then you tried to generate a number between 0 and 2 using %, the following outcomes would result:
正如公认的答案所表明的那样,“模偏差”的根源在于 的低值RAND_MAX
。他使用极小的RAND_MAX
(10)值来表明,如果 RAND_MAX 为 10,那么您尝试使用 % 生成 0 到 2 之间的数字,将导致以下结果:
rand() % 3 // if RAND_MAX were only 10, gives
output of rand() | rand()%3
0 | 0
1 | 1
2 | 2
3 | 0
4 | 1
5 | 2
6 | 0
7 | 1
8 | 2
9 | 0
So there are 4 outputs of 0's (4/10 chance) and only 3 outputs of 1 and 2 (3/10 chances each).
所以有 4 个 0 输出(4/10 机会),只有 3 个 1 和 2 输出(每个 3/10 机会)。
So it's biased. The lower numbers have a better chance of coming out.
所以是有偏见的。较低的数字有更好的机会出来。
But that only shows up so obviously when RAND_MAX
is small. Or more specifically, when the number your are modding by is large compared to RAND_MAX
.
但那只有在RAND_MAX
很小的时候才会表现得如此明显。或者更具体地说,当您正在修改的数字与RAND_MAX
.
A much better solution than looping(which is insanely inefficient and shouldn't even be suggested) is to use a PRNG with a much larger output range. The Mersenne Twisteralgorithm has a maximum output of 4,294,967,295. As such doing MersenneTwister::genrand_int32() % 10
for all intents and purposes, will be equally distributed and the modulo bias effect will all but disappear.
比循环(效率极低,甚至不应该被建议)更好的解决方案是使用具有更大输出范围的 PRNG。在梅森倍捻机算法的4,294,967,295最大输出。因此,MersenneTwister::genrand_int32() % 10
出于所有意图和目的这样做,将被平均分配,模偏差效应将几乎消失。
回答by Yavuz Koroglu
I just wrote a code for Von Neumann's Unbiased Coin Flip Method, that should theoretically eliminate any bias in the random number generation process. More info can be found at (http://en.wikipedia.org/wiki/Fair_coin)
我刚刚为 Von Neumann 的 Unbiased Coin Flip Method 编写了一个代码,理论上应该可以消除随机数生成过程中的任何偏差。更多信息可以在(http://en.wikipedia.org/wiki/Fair_coin)找到
int unbiased_random_bit() {
int x1, x2, prev;
prev = 2;
x1 = rand() % 2;
x2 = rand() % 2;
for (;; x1 = rand() % 2, x2 = rand() % 2)
{
if (x1 ^ x2) // 01 -> 1, or 10 -> 0.
{
return x2;
}
else if (x1 & x2)
{
if (!prev) // 0011
return 1;
else
prev = 1; // 1111 -> continue, bias unresolved
}
else
{
if (prev == 1)// 1100
return 0;
else // 0000 -> continue, bias unresolved
prev = 0;
}
}
}