Python 如何重新分区pyspark数据帧?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45844684/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 17:18:44  来源:igfitidea点击:

How to re-partition pyspark dataframe?

pythonapache-sparkmachine-learningpyspark

提问by Neo

data.rdd.getNumPartitions() # output 2456

Then I do
data.rdd.repartition(3000)But
data.rdd.getNumPartitions()# output is still 2456

然后我做
data.rdd.repartition(3000)但是
data.rdd.getNumPartitions()# 输出仍然是 2456

How to change number of partitions. One approach can be first convert DF into rdd,repartition it and then convert rdd back to DF. But this takes a lot of time. Also does increasing number of partitions make operations more distributed and so more fast? Thanks

如何更改分区数。一种方法可以是先将 DF 转换为 rdd,重新分区,然后将 rdd 转换回 DF。但这需要很多时间。此外,增加分区数量是否会使操作更加分散且速度更快?谢谢

回答by Michel Lemay

You can check the number of partitions:

您可以检查分区数:

data.rdd.partitions.size

To change the number of partitions:

要更改分区数:

newDF = data.repartition(3000)

You can check the number of partitions:

您可以检查分区数:

newDF.rdd.partitions.size

Beware of data shuffle when repartitionning and this is expensive. Take a look at coalesceif needed.

重新分区时要小心数据洗牌,这很昂贵。有需要的可以看看coalesce

回答by Ali Payne

print df.rdd.getNumPartitions()
# 1


df.repartition(5)
print df.rdd.getNumPartitions()
# 1


df = df.repartition(5)
print df.rdd.getNumPartitions()
# 5

see Spark: The definitive Guide chapter 5- Basic Structure Operations
ISBN-13: 978-1491912218
ISBN-10: 1491912219

请参阅Spark:
权威指南第 5 章 - 基本结构操作ISBN-13:978-1491912218
ISBN-10:1491912219

回答by Giorgos Myrianthous

If you want to increasethe number of partitions, you can use repartition():

如果要增加分区数,可以使用repartition()

data = data.repartition(3000)


If you want to decreasethe number of partitions, I would advise you to use coalesce(), that avoids full shuffle:

如果您想减少分区数,我建议您使用coalesce(),以避免完全洗牌:

Useful for running operations more efficiently after filtering down a large dataset.

在过滤大型数据集后更有效地运行操作很有用。

data = data.coalesce(10)