scala 如何使用 Apache Spark 计算准确的中位数？

Question

提问by pckmn

This pagecontains some statistics functions (mean, stdev, variance, etc.) but it does not contain the median. How can I calculate exact median?

此页面包含一些统计函数（均值、标准差、方差等），但不包含中值。如何计算准确的中位数？

Answer 1

回答by Eugene Zhulenev

You need to sort RDD and take element in the middle or average of two elements. Here is example with RDD[Int]:

您需要对 RDD 进行排序并取两个元素的中间或平均值。这是 RDD[Int] 的示例：

  import org.apache.spark.SparkContext._

  val rdd: RDD[Int] = ???

  val sorted = rdd.sortBy(identity).zipWithIndex().map {
    case (v, idx) => (idx, v)
  }

  val count = sorted.count()

  val median: Double = if (count % 2 == 0) {
    val l = count / 2 - 1
    val r = l + 1
    (sorted.lookup(l).head + sorted.lookup(r).head).toDouble / 2
  } else sorted.lookup(count / 2).head.toDouble

Answer 2

回答by Shaido - Reinstate Monica

Using Spark 2.0+and the DataFrame API you can use the approxQuantilemethod：

使用Spark 2.0+和 DataFrame API 你可以使用以下approxQuantile方法：

def approxQuantile(col: String, probabilities: Array[Double], relativeError: Double)

It will also work on multiple columns at the same time since Spark version 2.2. By setting probabilitesto Array(0.5)and relativeErrorto 0, it will compute the exact median. From the documentation:

自 Spark 2.2 版起，它还可以同时处理多个列。通过设置 probabilites为Array(0.5)和relativeError为 0，它将计算精确的中位数。从文档：

The relative target precision to achieve (greater than or equal to 0). If set to zero, the exact quantiles are computed, which could be very expensive.

要达到的相对目标精度（大于或等于 0）。如果设置为零，则计算精确的分位数，这可能非常昂贵。

Despite this, there seems to be some issues with the precision when setting relativeErrorto 0, see the question here. A low error close to 0 will in some instances work better (will depend on Spark version).

尽管如此，设置relativeError为 0时的精度似乎存在一些问题，请参阅此处的问题。在某些情况下，接近 0 的低错误会更好（取决于 Spark 版本）。

A small working example which calculates the median of the numbers from 1 to 99 (both inclusive) and uses a low relativeError:

一个小的工作示例，它计算从 1 到 99（包括两者）的数字的中位数并使用低relativeError：

val df = (1 to 99).toDF("num")
val median = df.stat.approxQuantile("num", Array(0.5), 0.001)(0)
println(median)

The median returned is 50.0.

返回的中位数是 50.0。

scala 如何使用 Apache Spark 计算准确的中位数？

提问by pckmn

回答by Eugene Zhulenev

回答by Shaido - Reinstate Monica

相关推荐

最近更新

标签

scala 如何使用 Apache Spark 计算准确的中位数？

提问by pckmn

回答by Eugene Zhulenev

回答by Shaido - Reinstate Monica

相关推荐

Mockito 匹配器、scala 值类和 NullPointerException

scala 如何在 sc.textFile 中加载本地文件，而不是 HDFS

如何从 Scala 的资源文件夹中读取文件？

scala 加特林喂食器的使用

相关推荐

最近更新

标签