Java 计算所有值的总和超过双精度限制的平均值的好方法是什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1930454/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 01:32:30  来源:igfitidea点击:

What is a good solution for calculating an average where the sum of all values exceeds a double's limits?

javaalgorithmstatistics

提问by Simon

I have a requirement to calculate the average of a very large set of doubles (10^9 values). The sum of the values exceeds the upper bound of a double, so does anyone know any neat little tricks for calculating an average that doesn't require also calculating the sum?

我需要计算一组非常大的双打(10^9 个值)的平均值。值的总和超过了双精度的上限,那么有谁知道计算平均值的任何巧妙的小技巧,而不需要计算总和?

I am using Java 1.5.

我正在使用 Java 1.5。

采纳答案by martinus

You can calculate the mean iteratively. This algorithm is simple, fast, you have to process each value just once, and the variables never get larger than the largest value in the set, so you won't get an overflow.

您可以迭代计算平均值。该算法简单、快速,您只需处理每个值一次,并且变量永远不会大于集合中的最大值,因此您不会出现溢出。

double mean(double[] ary) {
  double avg = 0;
  int t = 1;
  for (double x : ary) {
    avg += (x - avg) / t;
    ++t;
  }
  return avg;
}

Inside the loop avgalways is the average value of all values processed so far. In other words, if all the values are finite you should not get an overflow.

循环内部avg始终是到目前为止处理的所有值的平均值。换句话说,如果所有值都是有限的,则不应出现溢出。

回答by David M

You could take the average of averages of equal-sized subsets of numbers that don't exceed the limit.

您可以取不超过限制的相同大小数字子集的平均值。

回答by Alon

divide all values by the set size and then sum it up

将所有值除以设置的大小,然后求和

回答by Anon.

Option 1 is to use an arbitrary-precision library so you don't have an upper-bound.

选项 1 是使用任意精度的库,因此您没有上限。

Other options (which lose precision) are to sum in groups rather than all at once, or to divide before summing.

其他选项(失去精度)是分组求和而不是一次全部求和,或者在求和之前进行除法。

回答by Bozho

Apart from using the better approaches already suggested, you can use BigDecimalto make your calculations. (Bear in mind it is immutable)

除了使用已经建议的更好方法之外,您还可以使用BigDecimal进行计算。(请记住它是不可变的)

回答by Davide

IMHO, the most robust way of solving your problem is

恕我直言,解决您的问题最可靠的方法是

  1. sort your set
  2. split in groups of elements whose sum wouldn't overflow - since they are sorted, this is fast and easy
  3. do the sum in each group - and divide by the group size
  4. do the sum of the group's sum's (possibly calling this same algorithm recursively) - be aware that if the groups will not be equally sized, you'll have to weight them by their size
  1. 对你的集合进行排序
  2. 分成总和不会溢出的元素组 - 因为它们是排序的,所以这是快速而简单的
  3. 在每组中求和 - 然后除以组大小
  4. 对组的总和进行求和(可能递归调用相同的算法) - 请注意,如果组的大小不同,则必须按其大小对它们进行加权

One nice thing of this approach is that it scales nicely if you have a really large number of elements to sum - and a large number of processors/machines to use to do the math

这种方法的一个好处是,如果您有大量要求和的元素,并且有大量处理器/机器用于计算,它可以很好地扩展

回答by John Knoeller

A double can be divided by a power of 2 without loss of precision. So if your only problem if the absolute size of the sum you could pre-scale your numbers before summing them. But with a dataset of this size, there is still the risk that you will hit a situation where you are adding small numbers to a large one, and the small numbers will end up being mostly (or completely) ignored.

双精度数可以除以 2 的幂而不损失精度。因此,如果您唯一的问题是总和的绝对大小,您可以在求和之前预先调整您的数字。但是对于这种大小的数据集,仍然存在这样的风险,即您将小数添加到大数中,而小数最终将被大部分(或完全)忽略。

for instance, when you add 2.2e-20 to 9.0e20 the result is 9.0e20 because once the scales are adjusted so that they numbers can be added together, the smaller number is 0. Doubles can only hold about 17 digits, and you would need more than 40 digits to add these two numbers together without loss.

例如,当您将 2.2e-20 与 9.0e20 相加时,结果是 9.0e20,因为一旦调整了比例,使它们的数字可以相加,则较小的数字为 0。双打只能容纳大约 17 位数字,您将需要 40 多个数字才能将这两个数字相加而不会丢失。

So, depending on your data set and how many digits of precision you can afford to loose, you may need to do other things. Breaking the data into sets will help, but a better way to preserve precision might be to determine a rough average (you may already know this number). then subtract each value from the rough average before you sum it. That way you are summing the distances from the average, so your sum should never get very large.

因此,根据您的数据集以及您可以承受的精度位数,您可能需要做其他事情。将数据分成几组会有所帮助,但保持精度的更好方法可能是确定一个粗略的平均值(您可能已经知道这个数字)。然后在求和之前从粗略平均值中减去每个值。这样你就可以对与平均值的距离求和,所以你的总和永远不会变得很大。

Then you take the average delta, and add it to your rough sum to get the correct average. Keeping track of the min and max delta will also tell you how much precision you lost during the summing process. If you have lots of time and need a very accurate result, you can iterate.

然后你取平均增量,并将它添加到你的粗略总和中以获得正确的平均值。跟踪最小和最大增量还会告诉您在求和过程中损失了多少精度。如果您有很多时间并且需要非常准确的结果,则可以进行迭代。

回答by Lasse V. Karlsen

The very first issue I'd like to ask you is this:

我想问你的第一个问题是:

  • Do you know the number of values beforehand?
  • 您事先知道值的数量吗?

If not, then you have little choice but to sum, and count, and divide, to do the average. If Doubleisn't high enough precision to handle this, then tough luck, you can't use Double, you need to find a data type that can handle it.

如果不是,那么您别无选择,只能求和、计数和除以求平均值。如果Double没有足够高的精度来处理这个,那么运气不好,你不能使用Double,你需要找到一种可以处理它的数据类型。

If, on the other hand, you doknow the number of values beforehand, you can look at what you're really doing and change howyou do it, but keep the overall result.

如果,另一方面,你知道值的数量事先,你可以看看你真的做的和改变什么怎么你这样做,但保持整体效果。

The average of N values, stored in some collection A, is this:

存储在某个集合 A 中的 N 个值的平均值是这样的:

A[0]   A[1]   A[2]   A[3]          A[N-1]   A[N]
---- + ---- + ---- + ---- + .... + ------ + ----
 N      N      N      N               N       N

To calculate subsets of this result, you can split up the calculation into equally sized sets, so you can do this, for 3-valued sets (assuming the number of values is divisable by 3, otherwise you need a different divisor)

要计算此结果的子集,您可以将计算拆分为大小相等的集合,因此您可以这样做,对于 3 值集合(假设值的数量可被 3 整除,否则您需要不同的除数)

/ A[0]   A[1]   A[2] \   / A[3]   A[4]   A[5] \   //      A[N-1]   A[N] \
| ---- + ---- + ---- |   | ---- + ---- + ---- |   \    + ------ + ---- |
\  3      3      3   /   \  3      3      3   /   //        3       3   /
 --------------------- +  --------------------  + \      --------------
          N                        N                        N
         ---                      ---                      ---
          3                        3                        3

Note that you need equally sized sets, otherwise numbers in the last set, which will not have enough values compared to all the sets before it, will have a higher impact on the final result.

请注意,您需要相同大小的集合,否则最后一个集合中的数字与之前的所有集合相比没有足够的值,将对最终结果产生更大的影响。

Consider the numbers 1-7 in sequence, if you pick a set-size of 3, you'll get this result:

依次考虑数字 1-7,如果您选择 set-size 为 3,您将得到以下结果:

/ 1   2   3 \   / 4   5   6 \   / 7 \ 
| - + - + - | + | - + - + - | + | - |
\ 3   3   3 /   \ 3   3   3 /   \ 3 /
 -----------     -----------     ---
      y               y           y

which gives:

这使:

     2   5   7/3
     - + - + ---
     y   y    y

If y is 3 for all the sets, you get this:

如果所有集合的 y 都是 3,你会得到:

     2   5   7/3
     - + - + ---
     3   3    3

which gives:

这使:

2*3   5*3    7
--- + --- + ---
 9     9     9

which is:

即:

6   15   7
- + -- + -
9    9   9

which totals:

总计:

28
-- ~ 3,1111111111111111111111.........1111111.........
 9

The average of 1-7, is 4. Obviously this won't work. Note that if you do the above exercise with the numbers 1, 2, 3, 4, 5, 6, 7, 0, 0 (note the two zeroes at the end there), then you'll get the above result.

1-7 的平均值是 4。显然这行不通。请注意,如果您使用数字 1、2、3、4、5、6、7、0、0(注意末尾的两个零)进行上述练习,那么您将得到上述结果。

In other words, if you can't split the number of values up into equally sized sets, the last set will be counted as though it has the same number of values as all the sets preceeding it, but it will be padded with zeroes for all the missing values.

换句话说,如果您不能将值的数量分成相同大小的集合,则最后一个集合将被视为与它前面的所有集合具有相同的值数,但它将用零填充所有缺失值。

So, you need equally sized sets. Tough luck if your original input set consists of a prime number of values.

所以,你需要同样大小的集。如果您的原始输入集由质数的值组成,那就太走运了。

What I'm worried about here though is loss of precision. I'm not entirely sure Doublewill give you good enough precision in such a case, if it initially cannot hold the entire sum of the values.

我在这里担心的是精度损失。我不完全确定Double在这种情况下会给你足够好的精度,如果它最初不能保存值的全部总和。

回答by basszero

回答by Carl

So I don't repeat myself so much, let me state that I am assuming that the list of numbers is normally distributed, and that you can sum many numbers before you overflow. The technique still works for non-normal distros, but somethings will not meet the expectations I describe below.

所以我不再重复我自己,让我声明我假设数字列表是正态分布的,并且您可以在溢出之前对许多数字进行求和。该技术仍然适用于非正常发行版,但有些东西无法满足我在下面描述的期望。

--

——

Sum up a sub-series, keeping track of how many numbers you eat, until you approach the overflow, then take the average. This will give you an average a0, and count n0. Repeat until you exhaust the list. Now you should have many ai, ni.

总结一个子系列,记录你吃了多少个数字,直到接近溢出,然后取平均值。这会给你一个平均 a0,并计数 n0。重复直到你用完列表。现在你应该有很多ai,ni。

Each ai and ni should be relatively close, with the possible exception of the last bite of the list. You can mitigate that by under-biting near the end of the list.

每个 ai 和 ni 应该相对接近,列表的最后一口可能除外。您可以通过在列表末尾附近咬合不足来缓解这种情况。

You can combine any subset of these ai, ni by picking any ni in the subset (call it np) and dividing all the ni in the subset by that value. The max size of the subsets to combine is the roughly constant value of the n's.

您可以通过选择子集中的任何 ni(称为 np)并将子集中的所有 ni 除以该值来组合这些 ai、ni 的任何子集。要组合的子集的最大大小是 n 的大致恒定值。

The ni/np should be close to one. Now sum ni/np * ai and multiple by np/(sum ni), keeping track of sum ni. This gives you a new ni, ai combination, if you need to repeat the procedure.

ni/np 应该接近于 1。现在求和 ni/np * ai 和乘以 np/(sum ni),跟踪总和 ni。如果您需要重复该过程,这将为您提供一个新的 ni、ai 组合。

If you will need to repeat (i.e., the number of ai, ni pairs is much larger than the typical ni), try to keep relative n sizes constant by combining all the averages at one n level first, then combining at the next level, and so on.

如果您需要重复(即,ai、ni 对的数量比典型的 ni 大得多),请尝试通过首先组合一个 n 级别的所有平均值,然后在下一个级别组合来保持相对 n 大小不变,等等。

回答by Kevin Day

A random sampling of a small set of the full dataset will often result in a 'good enough' solution. You obviously have to make this determination yourself based on system requirements. Sample size can be remarkably small and still obtain reasonably good answers. This can be adaptively computed by calculating the average of an increasing number of randomly chosen samples - the average will converge within some interval.

对一小组完整数据集的随机抽样通常会产生“足够好”的解决方案。显然,您必须根据系统要求自己做出此决定。样本量可以非常小,但仍然可以获得相当好的答案。这可以通过计算越来越多的随机选择样本的平均值来自适应地计算 - 平均值将在某个间隔内收敛。

Sampling not only addresses the double overflow concern, but is much, much faster. Not applicable for all problems, but certainly useful for many problems.

采样不仅解决了双重溢出问题,而且速度要快得多。不适用于所有问题,但肯定对许多问题有用。