scala 如何对RDD进行排序
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33774830/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to sort RDD
提问by Sandip Armal Patil
I have scoreTriplets is RDD[ARRAY[String]] which I am sorting by following way.
我有 scoreTriplets 是 RDD[ARRAY[String]],我按以下方式排序。
var ScoreTripletsArray = scoreTriplets.collect()
if (ScoreTripletsArray.size > 0) {
/*Sort the ScoreTripletsArray descending by score field*/
scala.util.Sorting.stableSort(ScoreTripletsArray, (e1: Array[String], e2: Array[String]) => e1(3).toInt > e2(3).toInt)
}
But collect() will be heavy If there is elements in lack.
但是如果缺少元素,collect() 会很重。
So I need to sort RDD by scorewithout using collect().
scoreTriples is RDD[ARRAY[String]] each line of RDD will store Array of the below variables.
EdgeId sourceID destID scoresourceNAme destNAme distance
所以我需要在score不使用 collect() 的情况下对 RDD 进行排序。
scoreTriples 是 RDD[ARRAY[String]] RDD 的每一行将存储以下变量的数组。
EdgeId sourceID destID scoresourceNAme destNAme 距离
Please give me any reference or hint.
请给我任何参考或提示。
回答by zero323
Sorting will be, due to shuffling, an expensive operation even without collecting but you can use sortBymethod:
由于混洗,即使没有收集,排序也将是一项昂贵的操作,但您可以使用sortBy方法:
import scala.util.Random
val data = Seq.fill(10)(Array.fill(3)("") :+ Random.nextInt.toString)
val rdd = sc.parallelize(data)
val sorted = rdd.sortBy(_.apply(3).toInt)
sorted.take(3)
// Array[Array[String]] = Array(
// Array("", "", "", -1660860558),
// Array("", "", "", -1643214719),
// Array("", "", "", -1206834289))
If you're interested only in the top results then topand takeOrderedare usually preferred.
如果你有兴趣只在顶部结果的话,top和takeOrdered通常是首选。
import scala.math.Ordering
rdd.takeOrdered(2)(Ordering.by[Array[String], Int](_.apply(3).toInt))
// Array[Array[String]] =
// Array(Array("", "", "", -1660860558), Array("", "", "", -1643214719))
rdd.top(2)(Ordering.by[Array[String], Int](_.apply(3).toInt))
// Array[Array[String]] =
// Array(Array("", "", "", 1920955686), Array("", "", "", 1597012602))

