scala Spark save(write) parquet 只有一个文件

Question

提问by Easyhyum

if i write

如果我写

dataFrame.write.format("parquet").mode("append").save("temp.parquet")

in temp.parquet folder i got the same file numbers as the row numbers

在 temp.parquet 文件夹中，我得到了与行号相同的文件号

i think i'm not fully understand about parquet but is it natural?

我想我并不完全了解镶木地板，但它是自然的吗？

Answer 1

采纳答案by y2k-shubham

Use coalescebefore writeoperation

使用coalesce前写操作

dataFrame.coalesce(1).write.format("parquet").mode("append").save("temp.parquet")

EDIT-1

编辑-1

Upon a closer look, the docsdo warn about coalesce

仔细观察，文档确实警告了coalesce

However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1)

但是，如果您进行了剧烈的合并，例如 numPartitions = 1，这可能会导致您的计算发生在比您喜欢的更少的节点上（例如，在 numPartitions = 1 的情况下只有一个节点）

Therefore as suggested by @Amar, it's better to use repartition

因此，正如@Amar所建议的，最好使用repartition

Answer 2

回答by moped

Although previous answers are correct you have to understand repercusions that come after repartitioning or coalescing to a single partition. All your data will have to be transferred to a single worker just to immediately write it to a single file.

尽管以前的答案是正确的，但您必须了解重新分区或合并到单个分区后的影响。您的所有数据都必须传输到单个工作人员才能立即将其写入单个文件。

As it is repeatidly mentioned throughout the internet, you should use repartitionin this scenario despite the shuffle step that gets added to the execution plan. This step helps to use your cluster's power instead of sequentially merging files.

正如互联网上反复提到的那样，repartition尽管将 shuffle 步骤添加到执行计划中，但您应该在这种情况下使用。此步骤有助于使用集群的功能，而不是按顺序合并文件。

There is at least one alternative worth mentioning. You can write a simple script that would merge all the files into a single one. That way you will avoid generating massive network traffic to a single node of your cluster.

至少有一个替代方案值得一提。您可以编写一个简单的脚本，将所有文件合并为一个。这样您就可以避免向集群的单个节点生成大量网络流量。

Answer 3

回答by Amar

You can set partitions as 1 to save as single file

您可以将分区设置为 1 以保存为单个文件

dataFrame.write.repartitions(1).format("parquet").mode("append").save("temp.parquet")

scala Spark save(write) parquet 只有一个文件

提问by Easyhyum

采纳答案by y2k-shubham

回答by moped

回答by Amar

相关推荐

最近更新

标签

scala Spark save(write) parquet 只有一个文件

提问by Easyhyum

采纳答案by y2k-shubham

回答by moped

回答by Amar

相关推荐

如何在 Zeppelin 中检查 Spark 和 Scala 的版本？

scala 在火花数据框中映射

Scala groupBy 用于列表

scala 引起：java.lang.NullPointerException at org.apache.spark.sql.Dataset

相关推荐

最近更新

标签