scala 如何在对RDD中找到最大值?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26886275/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 06:42:20  来源:igfitidea点击:

How to find max value in pair RDD?

scalaapache-sparkpyspark

提问by Vijay Innamuri

I have a spark pair RDD (key, count) as below

我有一个火花对 RDD(键,计数)如下

Array[(String, Int)] = Array((a,1), (b,2), (c,1), (d,3))

How to find the key with highest count using spark scala API?

如何使用spark scala API找到计数最高的键?

EDIT: datatype of pair RDD is org.apache.spark.rdd.RDD[(String, Int)]

编辑:对 RDD 的数据类型是 org.apache.spark.rdd.RDD[(String, Int)]

回答by Sergii Lagutin

Use Array.maxBymethod:

使用Array.maxBy方法:

val a = Array(("a",1), ("b",2), ("c",1), ("d",3))
val maxKey = a.maxBy(_._2)
// maxKey: (String, Int) = (d,3)

or RDD.max:

RDD.max

val maxKey2 = rdd.max()(new Ordering[Tuple2[String, Int]]() {
  override def compare(x: (String, Int), y: (String, Int)): Int = 
      Ordering[Int].compare(x._2, y._2)
})

回答by Jacek Laskowski

Use takeOrdered(1)(Ordering[Int].reverse.on(_._2)):

使用takeOrdered(1)(Ordering[Int].reverse.on(_._2))

val a = Array(("a",1), ("b",2), ("c",1), ("d",3))
val rdd = sc.parallelize(a)
val maxKey = rdd.takeOrdered(1)(Ordering[Int].reverse.on(_._2))
// maxKey: Array[(String, Int)] = Array((d,3))

回答by Mayank

For Pyspark:

对于 Pyspark:

Let abe the pair RDD with keys as String and values as integers then

让我们a成为键为字符串、值为整数的 RDD 对

a.max(lambda x:x[1])

returns the key value pair with the maximum value. Basically the max function orders by the return value of the lambda function.

返回具有最大值的键值对。基本上,最大函数按 lambda 函数的返回值排序。

Here ais a pair RDD with elements such as ('key',int)and x[1]just refers to the integer part of the element.

a是一对带有元素的RDD,例如('key',int)x[1]只是指元素的整数部分。

Note that the maxfunction by itself will order by key and return the max value.

请注意,该max函数本身将按键排序并返回最大值。

Documentation is available at https://spark.apache.org/docs/1.5.0/api/python/pyspark.html#pyspark.RDD.max

文档位于https://spark.apache.org/docs/1.5.0/api/python/pyspark.html#pyspark.RDD.max

回答by Rubber Duck

Spark RDD's are more efficient timewise when they are left as RDD's and not turned into Arrays

当 Spark RDD 保留为 RDD 而不是变成数组时,它们在时间上更有效

strinIntTuppleRDD.reduce((x, y) => if(x._2 > y._2) x else y)