scala 查找大型数据集的均值和标准差

Question

提问by Vedant

I have about 1500 files on S3 (each file looks like this:)

我在 S3 上有大约 1500 个文件（每个文件看起来像这样:)

Format :
UserId \t ItemId:Score,ItemdId:Score,ItemId:Score \n
UserId \t ItemId:Score,ItemdId:Score,ItemId:Score \n

格式：
UserId \t ItemId:Score,ItemId:Score,ItemId:Score \n
UserId \t ItemId:Score,ItemdId:Score,ItemId:Score\n

I read the file as:

我将文件读为：

import scala.io.Source
val FileRead = Source.fromFile("/home/home/testdataFile1").mkString

Here is an example of what I get:

这是我得到的一个例子：

1152 401368:1.006,401207:1.03
1184 401230:1.119,40049:1.11,40029:1.31

How do I compute the average and standard deviation of the variable 'Score'?

如何计算变量“分数”的平均值和标准偏差？

Answer 1

回答by Daniel Darabos

While it's not explicit in the question, Apache Sparkis a good tool for doing this in a distributed way. I assume you have set up a Spark cluster. Read the files into an RDD:

虽然问题中没有明确说明，但Apache Spark是一种以分布式方式执行此操作的好工具。我假设您已经设置了一个 Spark 集群。将文件读入 RDD：

val lines: RDD[String] = sc.textFile("s3n://bucket/dir/*")

Pick out the "score" somehow:

以某种方式挑选出“分数”：

val scores: RDD[Double] = lines.map(_.split(":").last.toDouble).cache

.cachesaves it in memory. This avoids re-reading the files all the time, but can use a lot of RAM. Remove it if you want to trade speed for RAM.

.cache将其保存在内存中。这避免了一直重新读取文件，但会使用大量 RAM。如果您想以速度换取 RAM，请将其删除。

Calculate the metrics:

计算指标：

val count = scores.count
val mean = scores.sum / count
val devs = scores.map(score => (score - mean) * (score - mean))
val stddev = Math.sqrt(devs.sum / (count - 1))

Answer 2

回答by M.Rez

This question is not new, so maybe I can update the answers.

这个问题并不新鲜，所以也许我可以更新答案。

There are stddev functions (stddev, stddev_pop, and stddev_smap) is SparkSQL (import org.apache.spark.sql.functions) since spark version >= 1.6.0.

有 stddev 函数 ( stddev, stddev_pop, 和stddev_smap) 是 SparkSQL ( import org.apache.spark.sql.functions)，因为 spark 版本 >= 1.6.0。

Answer 3

回答by Bob Kuhar

I use Apache Commons Math for this stuff (http://commons.apache.org/proper/commons-math/userguide/stat.html), albeit from Java. You can stream stuff through the SummaryStatistics class so you aren't limited to the size of memory. Scala to Java interop should allow you to do this, but I haven't tried it. You should be able to each your way through the File line by line and stream the stuff through an instance of SummaryStatistics. How hard could it be in Scala?

我使用 Apache Commons Math（http://commons.apache.org/proper/commons-math/userguide/stat.html），尽管来自 Java。您可以通过 SummaryStatistics 类流式传输内容，因此您不受内存大小的限制。Scala to Java interop 应该允许您这样做，但我还没有尝试过。您应该能够逐行浏览 File 并通过 SummaryStatistics 实例流式传输这些内容。在 Scala 中有多难？

Lookie here, someone is off and Scala-izing the whole thing: https://code.google.com/p/scalalab/wiki/ApacheCommonMathsLibraryInScalaLab

看这里，有人离开了，正在对整个事情进行 Scala 化：https: //code.google.com/p/scalalab/wiki/ApacheCommonMathsLibraryInScalaLab

Answer 4

回答by user3478306

I don't think that storage space should be an issue so I would try putting all of the values into an array of doubles then adding up all of the values then use that and the number of elements in the array to calculate the mean.Then add up all of the absolute values of the differences between the value in the mean and divide that by the number of elements. Then take the square root.

我不认为存储空间应该是一个问题，所以我会尝试将所有值放入一个双精度数组中，然后将所有值相加，然后使用该值和数组中的元素数来计算平均值。然后将平均值中值之间差异的所有绝对值相加，然后除以元素数。然后取平方根。

scala 查找大型数据集的均值和标准差

提问by Vedant

回答by Daniel Darabos

回答by M.Rez

回答by Bob Kuhar

回答by user3478306

相关推荐

最近更新

标签

scala 查找大型数据集的均值和标准差

提问by Vedant

回答by Daniel Darabos

回答by M.Rez

回答by Bob Kuhar

回答by user3478306

相关推荐

scala 修改 Spark RDD foreach 中的集合

比较 Scala 中的 == 字符

scala 是否可以使用gradle开发scala项目？

Scala 中的主要方法

相关推荐

最近更新

标签