C# 如何在大量数字中找到平均值?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/895396/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I find the average in a LARGE set of numbers?
提问by
I have a large set of numbers, probably in the multiple gigabytes range. First issue is that I can't store all of these in memory. Second is that any attempt at addition of these will result in an overflow. I was thinking of using more of a rolling average, but it needs to be accurate. Any ideas?
我有一大堆数字,可能在数 GB 范围内。第一个问题是我无法将所有这些都存储在内存中。其次,任何添加这些的尝试都会导致溢出。我正在考虑使用更多的滚动平均值,但它需要准确。有任何想法吗?
These are all floating point numbers.
这些都是浮点数。
This is not read from a database, it is a CSV file collected from multiple sources. It has to be accurate as it is stored as parts of a second (e.g; 0.293482888929) and a rolling average can be the difference between .2 and .3
这不是从数据库中读取的,而是从多个来源收集的 CSV 文件。它必须准确,因为它存储为秒的一部分(例如:0.293482888929)并且滚动平均值可以是 0.2 和 0.3 之间的差异
It is a set of #'s representing how long users took to respond to certain form actions. For example when showing a messagebox, how long did it take them to press OK or Cancel. The data was sent to me stored as seconds.portions of a second; 1.2347 seconds for example. Converting it to milliseconds and I overflow int, long, etc.. rather quickly. Even if I don't convert it, I still overflow it rather quickly. I guess the one answer below is correct, that maybe I don't have to be 100% accurate, just look within a certain range inside of a sepcific StdDev and I would be close enough.
它是一组#,表示用户响应某些表单操作所花费的时间。例如,在显示消息框时,他们按“确定”或“取消”需要多长时间。发送给我的数据以秒的形式存储。例如 1.2347 秒。将其转换为毫秒,我很快就会溢出 int、long 等。即使我不转换它,我仍然很快就会溢出它。我想下面的一个答案是正确的,也许我不必 100% 准确,只需在特定 StdDev 内的某个范围内查看,我就会足够接近。
采纳答案by Alex Reynolds
You can sample randomly from your set ("population") to get an average ("mean"). The accuracy will be determined by how much your samples vary (as determined by "standard deviation" or variance).
您可以从您的集合(“人口”)中随机抽样以获得平均值(“均值”)。准确性将取决于您的样本变化多少(由“标准偏差”或方差决定)。
The advantage is that you have billions of observations, and you only have to sample a fraction of them to get a decent accuracy or the "confidence range" of your choice. If the conditions are right, this cuts down the amount of work you will be doing.
优点是您有数十亿个观测值,您只需对其中的一小部分进行采样即可获得不错的准确度或您选择的“置信范围”。如果条件合适,这会减少您将要做的工作量。
Here's a numerical libraryfor C# that includes a random sequence generator. Just make a random sequence of numbers that reference indices in your array of elements (from 1 to x, the number of elements in your array). Dereference to get the values, and then calculate your mean and standard deviation.
这是一个包含随机序列生成器的 C#数值库。只需创建一个随机的数字序列,引用元素数组中的索引(从 1 到x,数组中的元素数)。取消引用以获取值,然后计算您的均值和标准差。
If you want to test the distribution of your data, consider using the Chi-Squared Fittest or the K-Stest, which you'll find in many spreadsheet and statistical packages (e.g., R). That will help confirm whether this approach is usable or not.
如果您想测试数据的分布,请考虑使用卡方拟合检验或KS检验,您可以在许多电子表格和统计包(例如,R)中找到它们。这将有助于确认这种方法是否可用。
回答by S.Lott
Integers or floats?
整数还是浮点数?
If they're integers, you need to accumulate a frequency distribution by reading the numbers and recording how many of each value you see. That can be averaged easily.
如果它们是整数,您需要通过读取数字并记录您看到的每个值的数量来累积频率分布。这可以很容易地平均。
For floating point, this is a bit of a problem. Given the overall range of the floats, and the actual distribution, you have to work out a bin-size that preserves the accuracy you want without preserving all of the numbers.
对于浮点,这有点问题。考虑到浮点数的整体范围和实际分布,您必须计算出一个 bin 大小,以在不保留所有数字的情况下保留所需的准确性。
Edit
编辑
First, you need to sample your data to get a mean and a standard deviation. A few thousand points should be good enough.
首先,您需要对数据进行采样以获得平均值和标准差。几千分应该就够了。
Then, you need to determine a respectable range. Folks pick things like ±6σ (standard deviations) around the mean. You'll divide this range into as many buckets as you can stand.
然后,您需要确定一个可观的范围。人们在平均值周围选择±6σ(标准偏差)之类的东西。您将把这个范围划分为尽可能多的桶。
In effect, the number of buckets determines the number of significant digits in your average. So, pick 10,000 or 100,000 buckets to get 4 or 5 digits of precision. Since it's a measurement, odds are good that your measurements only have two or three digits.
实际上,桶的数量决定了平均值的有效位数。因此,选择 10,000 或 100,000 个桶以获得 4 或 5 位精度。由于它是一个测量值,因此您的测量值很可能只有两到三位数。
Edit
编辑
What you'll discover is that the mean of your initial sample is very close to the mean of any other sample. And any sample mean is close to the population mean. You'll note that most (but not all) of your means are with 1 standard deviation of each other.
您会发现初始样本的均值与任何其他样本的均值非常接近。并且任何样本均值都接近总体均值。您会注意到大多数(但不是全部)均值彼此相差 1 个标准差。
You should find that your measurement errors and inaccuracies are larger than your standard deviation.
您应该会发现您的测量误差和不准确度大于您的标准偏差。
This means that a sample mean is as useful as a population mean.
这意味着样本均值与总体均值一样有用。
回答by KM.
depending on the range of numbers it might be a good idea to have an array where the subscript is your number and the value is the quantity of that number, you could then do your calculation from this
根据数字的范围,最好有一个数组,其中下标是您的数字,值是该数字的数量,然后您可以从中进行计算
回答by Bill K
Wouldn't a rolling average be as accurate as anything else (discounting rounding errors, I mean)? It might be kind of slow because of all the dividing.
滚动平均值不会像其他任何东西一样准确(我的意思是贴现舍入误差)?由于所有的划分,它可能有点慢。
You could group batches of numbers and average them recursively. Like average 100 numbers 100 times, then average the result. This would be less thrashing and mostly addition.
您可以将一批数字分组并递归平均它们。就像平均 100 个数字 100 次,然后平均结果。这将减少颠簸,主要是加法。
In fact, if you added 256 or 512 at once you might be able to bit-shift the result by either 8 or 9, (I believe you could do this in a double by simply changing the floating point mantissa)--this would make your program extremely quick and it could be written recursively in just a few lines of code (not counting the unsafe operation of the mantissa shift).
事实上,如果您一次添加 256 或 512,您可能能够将结果移位 8 或 9,(我相信您可以通过简单地更改浮点尾数来实现双精度)--这将使您的程序非常快,只需几行代码即可递归编写(不包括尾数移位的不安全操作)。
Perhaps dividing by 256 would already use this optimization? I may have to speed test dividing by 255 vs 256 and see if there is some massive improvement. I'm guessing not.
也许除以 256 已经使用了这种优化?我可能需要加速测试除以 255 和 256,看看是否有一些巨大的改进。我猜不是。
回答by Jay
If the numbers are int's, accumulate the total in a long. If the numbers are long's ... what language are you using? In Java you could accumulate the total in a BigInteger, which is an integer which will grow as large as it needs to be. You could always write your own class to reproduce this functionality. The gist of it is just to make an array of integers to hold each "big number". When you add two numbers, loop through starting with the low-order value. If the result of the addition sets the high order bit, clear this bit and carry the one to the next column.
如果数字是整数,则以 long 形式累计总数。如果数字很长......你使用什么语言?在 Java 中,您可以在 BigInteger 中累积总数,这是一个可以根据需要增长的整数。您始终可以编写自己的类来重现此功能。它的要点只是制作一个整数数组来保存每个“大数字”。当您将两个数字相加时,从低位值开始循环。如果加法的结果设置了高位,则清除该位并将该位送入下一列。
Another option would be to find the average of, say, 1000 numbers at a time. Hold these intermediate results, then when you're done average them all together.
另一种选择是一次找到 1000 个数字的平均值。保留这些中间结果,然后在完成后将它们平均在一起。
回答by tom10
You could break the data into sets of, say, 1000 numbers, average these, and then average the averages.
您可以将数据分成 1000 个数字的集合,对这些数字求平均值,然后对平均值求平均值。
回答by Craig Gidney
Why is a sum of floating point numbers overflowing? In order for that to happen, you would need to have values near the max float value, which sounds odd.
为什么浮点数的总和溢出?为了实现这一点,您需要拥有接近最大浮点值的值,这听起来很奇怪。
If you were dealing with integers I'd suggest using a BigInteger, or breaking the set into multiple subsets, recursively averaging the subsets, then averaging the averages.
如果您正在处理整数,我建议您使用 BigInteger,或者将集合分解为多个子集,递归平均子集,然后对平均值求平均。
If you're dealing with floats, it gets a bit weird. A rolling average could become very inaccurate. I suggest using a rolling average which is only updated when you hit an overflow exception or the end of the set. So effectively dividing the set into non-overflowing sets.
如果你正在处理浮动,它会变得有点奇怪。滚动平均值可能会变得非常不准确。我建议使用滚动平均值,该平均值仅在您遇到溢出异常或集合结束时更新。因此有效地将集合划分为非溢出集合。
回答by Michael Borgwardt
Two ideas from me:
我的两个想法:
- If the numbers are ints, use an arbitrary precision library like IntX- this could be too slow, though
- If the numbers are floats and you know the total amount, you can divide each entry by that number and add up the result. If you use double, the precision should be sufficient.
- 如果数字是整数,请使用像IntX这样的任意精度库- 不过这可能太慢了
- 如果数字是浮点数并且您知道总数,则可以将每个条目除以该数字并将结果相加。如果使用double,精度应该足够了。
回答by lostlogic
Here's one way to do it in pseudocode:
这是用伪代码实现的一种方法:
average=first count=1 while more: count+=1 diff=next-average average+=diff/count return average
回答by Frank Krueger
You mean of 32-bit and 64-bit numbers. But why not just use a proper Rational Big Num library? If you have so much data and you want an exact mean, then just code it.
您的意思是 32 位和 64 位数字。但是为什么不直接使用合适的 Rational Big Num 库呢?如果您有如此多的数据并且想要一个确切的平均值,那么只需对其进行编码。
class RationalBignum {
public Bignum Numerator { get; set; }
public Bignum Denominator { get; set; }
}
class BigMeanr {
public static int Main(string[] argv) {
var sum = new RationalBignum(0);
var n = new Bignum(0);
using (var s = new FileStream(argv[0])) {
using (var r = new BinaryReader(s)) {
try {
while (true) {
var flt = r.ReadSingle();
rat = new RationalBignum(flt);
sum += rat;
n++;
}
}
catch (EndOfStreamException) {
break;
}
}
}
Console.WriteLine("The mean is: {0}", sum / n);
}
}
Just remember, there are more numeric types out there than the ones your compiler offers you.
请记住,那里的数字类型比您的编译器提供给您的数字类型更多。