Spark - scala:随机将RDD洗牌/将RDD分成两个随机部分

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24864828/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 06:25:37  来源:igfitidea点击:

Spark - scala: shuffle RDD / split RDD into two random parts randomly

scalaapache-sparkrdd

提问by griffon vulture

How can I take a rdd array of spark, and split it into two rdds randomly so each rdd will include some part of data (lets say 97% and 3%).

我如何才能获取一个 rdd 数组,并将其随机拆分为两个 rdd,以便每个 rdd 将包含部分数据(假设为 97% 和 3%)。

I thought to shuffle the list and then shuffledList.take((0.97*rddList.count).toInt)

我想洗洗名单然后 shuffledList.take((0.97*rddList.count).toInt)

But how can I Shuffle the rdd?

但是我怎样才能洗牌 rdd?

Or is there a better way to split the list?

还是有更好的方法来拆分列表?

回答by griffon vulture

I've found a simple and fast way to split the array:

我找到了一种简单快速的拆分数组的方法:

val Array(f1,f2) = data.randomSplit(Array(0.97, 0.03))

It will split the data using the provided weights.

它将使用提供的权重拆分数据。

回答by Shyamendra Solanki

You should use randomSplitmethod:

你应该使用randomSplit方法:

def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]]

// Randomly splits this RDD with the provided weights.
// weights for splits, will be normalized if they don't sum to 1
// returns split RDDs in an array

Here is its implementationin spark 1.0:

这是它在 spark 1.0 中的实现

def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]] = {
    val sum = weights.sum
    val normalizedCumWeights = weights.map(_ / sum).scanLeft(0.0d)(_ + _)
    normalizedCumWeights.sliding(2).map { x =>
       new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](x(0), x(1)),seed)
    }.toArray
}