scala 如何对RDD进行排序

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33774830/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:48:23  来源:igfitidea点击:

How to sort RDD

scalasortingapache-sparkrdd

提问by Sandip Armal Patil

I have scoreTriplets is RDD[ARRAY[String]] which I am sorting by following way.

我有 scoreTriplets 是 RDD[ARRAY[String]],我按以下方式排序。

var ScoreTripletsArray = scoreTriplets.collect()
  if (ScoreTripletsArray.size > 0) {        
    /*Sort the ScoreTripletsArray descending by score field*/        
    scala.util.Sorting.stableSort(ScoreTripletsArray, (e1: Array[String], e2: Array[String]) => e1(3).toInt > e2(3).toInt)
}

But collect() will be heavy If there is elements in lack.

但是如果缺少元素,collect() 会很重。

So I need to sort RDD by scorewithout using collect().
scoreTriples is RDD[ARRAY[String]] each line of RDD will store Array of the below variables.
EdgeId sourceID destID scoresourceNAme destNAme distance

所以我需要在score不使用 collect() 的情况下对 RDD 进行排序。
scoreTriples 是 RDD[ARRAY[String]] RDD 的每一行将存储以下变量的数组。
EdgeId sourceID destID scoresourceNAme destNAme 距离

Please give me any reference or hint.

请给我任何参考或提示。

回答by zero323

Sorting will be, due to shuffling, an expensive operation even without collecting but you can use sortBymethod:

由于混洗,即使没有收集,排序也将是一项昂贵的操作,但您可以使用sortBy方法:

import scala.util.Random

val data = Seq.fill(10)(Array.fill(3)("") :+ Random.nextInt.toString)
val rdd  = sc.parallelize(data)

val sorted = rdd.sortBy(_.apply(3).toInt)
sorted.take(3)
// Array[Array[String]] = Array(
//   Array("", "", "", -1660860558),
//   Array("", "", "", -1643214719),
//   Array("", "", "", -1206834289))

If you're interested only in the top results then topand takeOrderedare usually preferred.

如果你有兴趣只在顶部结果的话,toptakeOrdered通常是首选。

import scala.math.Ordering

rdd.takeOrdered(2)(Ordering.by[Array[String], Int](_.apply(3).toInt))
// Array[Array[String]] = 
//   Array(Array("", "", "", -1660860558), Array("", "", "", -1643214719))

rdd.top(2)(Ordering.by[Array[String], Int](_.apply(3).toInt))
// Array[Array[String]] = 
//   Array(Array("", "", "", 1920955686), Array("", "", "", 1597012602))

回答by ponkin

There is sortBy method in RDD (see doc). You can do something like that

RDD 中有 sortBy 方法(参见doc)。你可以做这样的事情

scoreTriplets.sortBy( _(3).toInt )