Spark - scala:随机将RDD洗牌/将RDD分成两个随机部分
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24864828/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Spark - scala: shuffle RDD / split RDD into two random parts randomly
提问by griffon vulture
How can I take a rdd array of spark, and split it into two rdds randomly so each rdd will include some part of data (lets say 97% and 3%).
我如何才能获取一个 rdd 数组,并将其随机拆分为两个 rdd,以便每个 rdd 将包含部分数据(假设为 97% 和 3%)。
I thought to shuffle the list and then shuffledList.take((0.97*rddList.count).toInt)
我想洗洗名单然后 shuffledList.take((0.97*rddList.count).toInt)
But how can I Shuffle the rdd?
但是我怎样才能洗牌 rdd?
Or is there a better way to split the list?
还是有更好的方法来拆分列表?
回答by griffon vulture
I've found a simple and fast way to split the array:
我找到了一种简单快速的拆分数组的方法:
val Array(f1,f2) = data.randomSplit(Array(0.97, 0.03))
It will split the data using the provided weights.
它将使用提供的权重拆分数据。
回答by Shyamendra Solanki
You should use randomSplitmethod:
你应该使用randomSplit方法:
def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]]
// Randomly splits this RDD with the provided weights.
// weights for splits, will be normalized if they don't sum to 1
// returns split RDDs in an array
Here is its implementationin spark 1.0:
这是它在 spark 1.0 中的实现:
def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]] = {
val sum = weights.sum
val normalizedCumWeights = weights.map(_ / sum).scanLeft(0.0d)(_ + _)
normalizedCumWeights.sliding(2).map { x =>
new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](x(0), x(1)),seed)
}.toArray
}

