scala 如何在对RDD中找到最大值?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/26886275/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to find max value in pair RDD?
提问by Vijay Innamuri
I have a spark pair RDD (key, count) as below
我有一个火花对 RDD(键,计数)如下
Array[(String, Int)] = Array((a,1), (b,2), (c,1), (d,3))
How to find the key with highest count using spark scala API?
如何使用spark scala API找到计数最高的键?
EDIT: datatype of pair RDD is org.apache.spark.rdd.RDD[(String, Int)]
编辑:对 RDD 的数据类型是 org.apache.spark.rdd.RDD[(String, Int)]
回答by Sergii Lagutin
Use Array.maxBymethod:
使用Array.maxBy方法:
val a = Array(("a",1), ("b",2), ("c",1), ("d",3))
val maxKey = a.maxBy(_._2)
// maxKey: (String, Int) = (d,3)
or RDD.max:
或RDD.max:
val maxKey2 = rdd.max()(new Ordering[Tuple2[String, Int]]() {
override def compare(x: (String, Int), y: (String, Int)): Int =
Ordering[Int].compare(x._2, y._2)
})
回答by Jacek Laskowski
Use takeOrdered(1)(Ordering[Int].reverse.on(_._2)):
使用takeOrdered(1)(Ordering[Int].reverse.on(_._2)):
val a = Array(("a",1), ("b",2), ("c",1), ("d",3))
val rdd = sc.parallelize(a)
val maxKey = rdd.takeOrdered(1)(Ordering[Int].reverse.on(_._2))
// maxKey: Array[(String, Int)] = Array((d,3))
回答by Mayank
For Pyspark:
对于 Pyspark:
Let abe the pair RDD with keys as String and values as integers then
让我们a成为键为字符串、值为整数的 RDD 对
a.max(lambda x:x[1])
returns the key value pair with the maximum value. Basically the max function orders by the return value of the lambda function.
返回具有最大值的键值对。基本上,最大函数按 lambda 函数的返回值排序。
Here ais a pair RDD with elements such as ('key',int)and x[1]just refers to the integer part of the element.
这a是一对带有元素的RDD,例如('key',int)和x[1]只是指元素的整数部分。
Note that the maxfunction by itself will order by key and return the max value.
请注意,该max函数本身将按键排序并返回最大值。
Documentation is available at https://spark.apache.org/docs/1.5.0/api/python/pyspark.html#pyspark.RDD.max
文档位于https://spark.apache.org/docs/1.5.0/api/python/pyspark.html#pyspark.RDD.max
回答by Rubber Duck
Spark RDD's are more efficient timewise when they are left as RDD's and not turned into Arrays
当 Spark RDD 保留为 RDD 而不是变成数组时,它们在时间上更有效
strinIntTuppleRDD.reduce((x, y) => if(x._2 > y._2) x else y)

