Python 如何重新分区pyspark数据帧?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45844684/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to re-partition pyspark dataframe?
提问by Neo
data.rdd.getNumPartitions() # output 2456
Then I dodata.rdd.repartition(3000)
Butdata.rdd.getNumPartitions()
# output is still 2456
然后我做data.rdd.repartition(3000)
但是data.rdd.getNumPartitions()
# 输出仍然是 2456
How to change number of partitions. One approach can be first convert DF into rdd,repartition it and then convert rdd back to DF. But this takes a lot of time. Also does increasing number of partitions make operations more distributed and so more fast? Thanks
如何更改分区数。一种方法可以是先将 DF 转换为 rdd,重新分区,然后将 rdd 转换回 DF。但这需要很多时间。此外,增加分区数量是否会使操作更加分散且速度更快?谢谢
回答by Michel Lemay
You can check the number of partitions:
您可以检查分区数:
data.rdd.partitions.size
To change the number of partitions:
要更改分区数:
newDF = data.repartition(3000)
You can check the number of partitions:
您可以检查分区数:
newDF.rdd.partitions.size
Beware of data shuffle when repartitionning and this is expensive. Take a look at coalesce
if needed.
重新分区时要小心数据洗牌,这很昂贵。有需要的可以看看coalesce
。
回答by Ali Payne
print df.rdd.getNumPartitions()
# 1
df.repartition(5)
print df.rdd.getNumPartitions()
# 1
df = df.repartition(5)
print df.rdd.getNumPartitions()
# 5
see Spark: The definitive Guide chapter 5- Basic Structure Operations
ISBN-13: 978-1491912218
ISBN-10: 1491912219
请参阅Spark:
权威指南第 5 章 - 基本结构操作ISBN-13:978-1491912218
ISBN-10:1491912219
回答by Giorgos Myrianthous
If you want to increasethe number of partitions, you can use repartition()
:
如果要增加分区数,可以使用repartition()
:
data = data.repartition(3000)
If you want to decreasethe number of partitions, I would advise you to use coalesce()
, that avoids full shuffle:
如果您想减少分区数,我建议您使用coalesce()
,以避免完全洗牌:
Useful for running operations more efficiently after filtering down a large dataset.
在过滤大型数据集后更有效地运行操作很有用。
data = data.coalesce(10)