scala 查找大型数据集的均值和标准差
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24192265/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Finding mean and standard deviation of a large dataset
提问by Vedant
I have about 1500 files on S3 (each file looks like this:)
我在 S3 上有大约 1500 个文件(每个文件看起来像这样:)
Format :
UserId \t ItemId:Score,ItemdId:Score,ItemId:Score \n
UserId \t ItemId:Score,ItemdId:Score,ItemId:Score \n
格式:
UserId \t ItemId:Score,ItemId:Score,ItemId:Score \n
UserId \t ItemId:Score,ItemdId:Score,ItemId:Score\n
I read the file as:
我将文件读为:
import scala.io.Source
val FileRead = Source.fromFile("/home/home/testdataFile1").mkString
Here is an example of what I get:
这是我得到的一个例子:
1152 401368:1.006,401207:1.03
1184 401230:1.119,40049:1.11,40029:1.31
1152 401368:1.006,401207:1.03
1184 401230:1.119,40049:1.11,40029:1.31
How do I compute the average and standard deviation of the variable 'Score'?
如何计算变量“分数”的平均值和标准偏差?
回答by Daniel Darabos
While it's not explicit in the question, Apache Sparkis a good tool for doing this in a distributed way. I assume you have set up a Spark cluster. Read the files into an RDD:
虽然问题中没有明确说明,但Apache Spark是一种以分布式方式执行此操作的好工具。我假设您已经设置了一个 Spark 集群。将文件读入 RDD:
val lines: RDD[String] = sc.textFile("s3n://bucket/dir/*")
Pick out the "score" somehow:
以某种方式挑选出“分数”:
val scores: RDD[Double] = lines.map(_.split(":").last.toDouble).cache
.cachesaves it in memory. This avoids re-reading the files all the time, but can use a lot of RAM. Remove it if you want to trade speed for RAM.
.cache将其保存在内存中。这避免了一直重新读取文件,但会使用大量 RAM。如果您想以速度换取 RAM,请将其删除。
Calculate the metrics:
计算指标:
val count = scores.count
val mean = scores.sum / count
val devs = scores.map(score => (score - mean) * (score - mean))
val stddev = Math.sqrt(devs.sum / (count - 1))
回答by M.Rez
This question is not new, so maybe I can update the answers.
这个问题并不新鲜,所以也许我可以更新答案。
There are stddev functions (stddev, stddev_pop, and stddev_smap) is SparkSQL (import org.apache.spark.sql.functions) since spark version >= 1.6.0.
有 stddev 函数 ( stddev, stddev_pop, 和stddev_smap) 是 SparkSQL ( import org.apache.spark.sql.functions),因为 spark 版本 >= 1.6.0。
回答by Bob Kuhar
I use Apache Commons Math for this stuff (http://commons.apache.org/proper/commons-math/userguide/stat.html), albeit from Java. You can stream stuff through the SummaryStatistics class so you aren't limited to the size of memory. Scala to Java interop should allow you to do this, but I haven't tried it. You should be able to each your way through the File line by line and stream the stuff through an instance of SummaryStatistics. How hard could it be in Scala?
我使用 Apache Commons Math(http://commons.apache.org/proper/commons-math/userguide/stat.html),尽管来自 Java。您可以通过 SummaryStatistics 类流式传输内容,因此您不受内存大小的限制。Scala to Java interop 应该允许您这样做,但我还没有尝试过。您应该能够逐行浏览 File 并通过 SummaryStatistics 实例流式传输这些内容。在 Scala 中有多难?
Lookie here, someone is off and Scala-izing the whole thing: https://code.google.com/p/scalalab/wiki/ApacheCommonMathsLibraryInScalaLab
看这里,有人离开了,正在对整个事情进行 Scala 化:https: //code.google.com/p/scalalab/wiki/ApacheCommonMathsLibraryInScalaLab
回答by user3478306
I don't think that storage space should be an issue so I would try putting all of the values into an array of doubles then adding up all of the values then use that and the number of elements in the array to calculate the mean.Then add up all of the absolute values of the differences between the value in the mean and divide that by the number of elements. Then take the square root.
我不认为存储空间应该是一个问题,所以我会尝试将所有值放入一个双精度数组中,然后将所有值相加,然后使用该值和数组中的元素数来计算平均值。然后将平均值中值之间差异的所有绝对值相加,然后除以元素数。然后取平方根。

