Java 如何使用Apache spark计算平均值?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24694303/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-14 14:12:02  来源:igfitidea点击:

How to compute the mean with Apache spark?

javascalaapache-sparkapache-spark-mllib

提问by merours

I dispose of a list of Double stored like this :

我处理了这样存储的 Double 列表:

JavaRDD<Double> myDoubles

I would like to compute the mean of this list. According to the documentation, :

我想计算这个列表的平均值。根据文档,:

All of MLlib's methods use Java-friendly types, so you can import and call them there the same way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the Spark Java API uses a separate JavaRDD class. You can convert a Java RDD to a Scala one by calling .rdd() on your JavaRDD object.

MLlib 的所有方法都使用 Java 友好的类型,因此您可以像在 Scala 中一样导入和调用它们。唯一需要注意的是,这些方法采用 Scala RDD 对象,而 Spark Java API 使用单独的 JavaRDD 类。您可以通过在 JavaRDD 对象上调用 .rdd() 将 Java RDD 转换为 Scala RDD。

On the same page, I see the following code :

在同一页面上,我看到以下代码:

val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()

From my understanding, this is equivalent (in term of types) to

根据我的理解,这相当于(就类型而言)

Double MSE = RDD<Double>.mean()

As a consequence, I tried to compute the mean of my JavaRDDlike this :

因此,我试图计算我JavaRDD这样的平均值:

myDoubles.rdd().mean()

However, it doesn't work and gives me the following eror : The method mean() is undefined for the type RDD<Double>. I also didn't find mention of this function in the RDD scala documentation. . Is this because of a bad understanding of my side, or is this something else ?

但是,它不起作用并给我以下错误:The method mean() is undefined for the type RDD<Double>。我也没有在RDD scala 文档中找到提到这个功能。. 这是因为对我这边的了解不好,还是其他原因?

采纳答案by merours

It's actually quite simple: mean()is defined for the JavaDoubleRDDclass. I didn't find how to cast from JavaRDD<Double>to JavaDoubleRDD, but in my case, it was not necessary.

它实际上很简单:mean()是为JavaDoubleRDD类定义的。我没有找到如何从JavaRDD<Double>to 转换JavaDoubleRDD,但就我而言,没有必要。

Indeed, this line in scala

实际上,Scala 中的这一行

val mean = valuesAndPreds.map{case(v, p) => (v - p)}.mean()

can be expressed in Java as

可以用Java表示为

double mean = valuesAndPreds.mapToDouble(tuple -> tuple._1 - tuple._2).mean();

回答by Viliam Simko

Don't forget to add import org.apache.spark.SparkContext._at the top of your scala file. Also make sure you are calling mean()on RDD[Double]

不要忘记import org.apache.spark.SparkContext._在 Scala 文件的顶部添加。另外,还要确保您所呼叫mean()RDD[Double]