Python 如何重新分区pyspark数据帧？

Question

提问by Neo

data.rdd.getNumPartitions() # output 2456

Then I do
data.rdd.repartition(3000)But
data.rdd.getNumPartitions()# output is still 2456

然后我做
data.rdd.repartition(3000)但是
data.rdd.getNumPartitions()# 输出仍然是 2456

How to change number of partitions. One approach can be first convert DF into rdd,repartition it and then convert rdd back to DF. But this takes a lot of time. Also does increasing number of partitions make operations more distributed and so more fast? Thanks

如何更改分区数。一种方法可以是先将 DF 转换为 rdd，重新分区，然后将 rdd 转换回 DF。但这需要很多时间。此外，增加分区数量是否会使操作更加分散且速度更快？谢谢

Answer 1

回答by Michel Lemay

You can check the number of partitions:

您可以检查分区数：

data.rdd.partitions.size

To change the number of partitions:

要更改分区数：

newDF = data.repartition(3000)

You can check the number of partitions:

您可以检查分区数：

newDF.rdd.partitions.size

Beware of data shuffle when repartitionning and this is expensive. Take a look at coalesceif needed.

重新分区时要小心数据洗牌，这很昂贵。有需要的可以看看coalesce。

Answer 2

回答by Ali Payne

print df.rdd.getNumPartitions()
# 1


df.repartition(5)
print df.rdd.getNumPartitions()
# 1


df = df.repartition(5)
print df.rdd.getNumPartitions()
# 5

see Spark: The definitive Guide chapter 5- Basic Structure Operations
ISBN-13: 978-1491912218
ISBN-10: 1491912219

请参阅Spark：
权威指南第 5 章 - 基本结构操作ISBN-13：978-1491912218
ISBN-10：1491912219

Answer 3

回答by Giorgos Myrianthous

If you want to increasethe number of partitions, you can use repartition():

如果要增加分区数，可以使用repartition()：

data = data.repartition(3000)

If you want to decreasethe number of partitions, I would advise you to use coalesce(), that avoids full shuffle:

如果您想减少分区数，我建议您使用coalesce()，以避免完全洗牌：

Useful for running operations more efficiently after filtering down a large dataset.

在过滤大型数据集后更有效地运行操作很有用。

data = data.coalesce(10)

Python 如何重新分区pyspark数据帧？

提问by Neo

回答by Michel Lemay

回答by Ali Payne

回答by Giorgos Myrianthous

相关推荐

最近更新

标签

Python 如何重新分区pyspark数据帧？

提问by Neo

回答by Michel Lemay

回答by Ali Payne

回答by Giorgos Myrianthous

相关推荐

Python Numpy isnan() 在浮点数组上失败（来自 Pandas 数据帧应用）

Python 导入错误：没有名为 utils 的模块

用python读取.doc文件

Python：如何将带有整数值的变量相加？

相关推荐

最近更新

标签