Python Spark 数据帧随机拆分
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/40293970/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Spark Data Frame Random Splitting
提问by Baktaawar
I have a spark data frame which I want to divide into train, validation and test in the ratio 0.60, 0.20,0.20.
我有一个火花数据框,我想以 0.60、0.20、0.20 的比例将其分为训练、验证和测试。
I used the following code for the same:
我使用了以下代码:
def data_split(x):
global data_map_var
d_map = data_map_var.value
data_row = x.asDict()
import random
rand = random.uniform(0.0,1.0)
ret_list = ()
if rand <= 0.6:
ret_list = (data_row['TRANS'] , d_map[data_row['ITEM']] , data_row['Ratings'] , 'train')
elif rand <=0.8:
ret_list = (data_row['TRANS'] , d_map[data_row['ITEM']] , data_row['Ratings'] , 'test')
else:
ret_list = (data_row['TRANS'] , d_map[data_row['ITEM']] , data_row['Ratings'] , 'validation')
return ret_list
?
?
split_sdf = ratings_sdf.map(data_split)
train_sdf = split_sdf.filter(lambda x : x[-1] == 'train').map(lambda x :(x[0],x[1],x[2]))
test_sdf = split_sdf.filter(lambda x : x[-1] == 'test').map(lambda x :(x[0],x[1],x[2]))
validation_sdf = split_sdf.filter(lambda x : x[-1] == 'validation').map(lambda x :(x[0],x[1],x[2]))
?
print "Total Records in Original Ratings RDD is {}".format(split_sdf.count())
?
print "Total Records in training data RDD is {}".format(train_sdf.count())
?
print "Total Records in validation data RDD is {}".format(validation_sdf.count())
?
print "Total Records in test data RDD is {}".format(test_sdf.count())
?
?
#help(ratings_sdf)
Total Records in Original Ratings RDD is 300001
Total Records in training data RDD is 180321
Total Records in validation data RDD is 59763
Total Records in test data RDD is 59837
My original data frame is ratings_sdf which I use to pass a mapper function which does the splitting.
我的原始数据框是 ratings_sdf,我用它来传递执行拆分的映射器函数。
If you check the total sum of train, validation and test does not sum to split (original ratings) count. And these numbers change at every run of the code.
如果您检查训练的总和,验证和测试的总和不会拆分(原始评分)计数。这些数字在每次运行代码时都会发生变化。
Where is the remaining records going and why the sum is not equal?
剩余的记录在哪里,为什么总和不相等?
回答by zero323
TL;DRIf you want to split DataFrame
use randomSplit
method:
TL;DR如果要拆分DataFrame
使用randomSplit
方法:
ratings_sdf.randomSplit([0.6, 0.2, 0.2])
Your code is just wrong on multiple levels but there are two fundamental problems that make it broken beyond repair:
您的代码在多个层面上都是错误的,但有两个基本问题使其无法修复:
Spark transformations can be evaluated arbitrary number of times and functions you use should be referentially transparent and side effect free. Your code evaluates
split_sdf
multiple times and you use stateful RNGdata_split
so each time results are different.This results in a behavior you describe where each child sees different state of the parent RDD.
You don't properly initialize RNG and in consequence random values you get are not independent.
Spark 转换可以被评估任意次数,并且你使用的函数应该是引用透明的并且没有副作用。您的代码会进行
split_sdf
多次评估,并且您使用有状态的 RNG,data_split
因此每次结果都不同。这会导致您描述的行为,其中每个孩子看到父 RDD 的不同状态。
您没有正确初始化 RNG,因此您获得的随机值不是独立的。