scala 如何使用 Apache Spark 计算准确的中位数?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/28158729/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I calculate exact median with Apache Spark?
提问by pckmn
回答by Eugene Zhulenev
You need to sort RDD and take element in the middle or average of two elements. Here is example with RDD[Int]:
您需要对 RDD 进行排序并取两个元素的中间或平均值。这是 RDD[Int] 的示例:
import org.apache.spark.SparkContext._
val rdd: RDD[Int] = ???
val sorted = rdd.sortBy(identity).zipWithIndex().map {
case (v, idx) => (idx, v)
}
val count = sorted.count()
val median: Double = if (count % 2 == 0) {
val l = count / 2 - 1
val r = l + 1
(sorted.lookup(l).head + sorted.lookup(r).head).toDouble / 2
} else sorted.lookup(count / 2).head.toDouble
回答by Shaido - Reinstate Monica
Using Spark 2.0+and the DataFrame API you can use the approxQuantilemethod:
使用Spark 2.0+和 DataFrame API 你可以使用以下approxQuantile方法:
def approxQuantile(col: String, probabilities: Array[Double], relativeError: Double)
It will also work on multiple columns at the same time since Spark version 2.2. By setting probabilitesto Array(0.5)and relativeErrorto 0, it will compute the exact median. From the documentation:
自 Spark 2.2 版起,它还可以同时处理多个列。通过设置 probabilites为Array(0.5)和relativeError为 0,它将计算精确的中位数。从文档:
The relative target precision to achieve (greater than or equal to 0). If set to zero, the exact quantiles are computed, which could be very expensive.
要达到的相对目标精度(大于或等于 0)。如果设置为零,则计算精确的分位数,这可能非常昂贵。
Despite this, there seems to be some issues with the precision when setting relativeErrorto 0, see the question here. A low error close to 0 will in some instances work better (will depend on Spark version).
尽管如此,设置relativeError为 0时的精度似乎存在一些问题,请参阅此处的问题。在某些情况下,接近 0 的低错误会更好(取决于 Spark 版本)。
A small working example which calculates the median of the numbers from 1 to 99 (both inclusive) and uses a low relativeError:
一个小的工作示例,它计算从 1 到 99(包括两者)的数字的中位数并使用低relativeError:
val df = (1 to 99).toDF("num")
val median = df.stat.approxQuantile("num", Array(0.5), 0.001)(0)
println(median)
The median returned is 50.0.
返回的中位数是 50.0。

