scala Spark save(write) parquet 只有一个文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/51628958/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Spark save(write) parquet only one file
提问by Easyhyum
if i write
如果我写
dataFrame.write.format("parquet").mode("append").save("temp.parquet")
in temp.parquet folder i got the same file numbers as the row numbers
在 temp.parquet 文件夹中,我得到了与行号相同的文件号
i think i'm not fully understand about parquet but is it natural?
我想我并不完全了解镶木地板,但它是自然的吗?
采纳答案by y2k-shubham
Use coalescebefore writeoperation
使用coalesce前写操作
dataFrame.coalesce(1).write.format("parquet").mode("append").save("temp.parquet")
dataFrame.coalesce(1).write.format("parquet").mode("append").save("temp.parquet")
EDIT-1
编辑-1
Upon a closer look, the docsdo warn about coalesce
仔细观察,文档确实警告了coalesce
However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1)
但是,如果您进行了剧烈的合并,例如 numPartitions = 1,这可能会导致您的计算发生在比您喜欢的更少的节点上(例如,在 numPartitions = 1 的情况下只有一个节点)
Therefore as suggested by @Amar, it's better to use repartition
因此,正如@Amar所建议的,最好使用repartition
回答by moped
Although previous answers are correct you have to understand repercusions that come after repartitioning or coalescing to a single partition. All your data will have to be transferred to a single worker just to immediately write it to a single file.
尽管以前的答案是正确的,但您必须了解重新分区或合并到单个分区后的影响。您的所有数据都必须传输到单个工作人员才能立即将其写入单个文件。
As it is repeatidly mentioned throughout the internet, you should use repartitionin this scenario despite the shuffle step that gets added to the execution plan. This step helps to use your cluster's power instead of sequentially merging files.
正如互联网上反复提到的那样,repartition尽管将 shuffle 步骤添加到执行计划中,但您应该在这种情况下使用。此步骤有助于使用集群的功能,而不是按顺序合并文件。
There is at least one alternative worth mentioning. You can write a simple script that would merge all the files into a single one. That way you will avoid generating massive network traffic to a single node of your cluster.
至少有一个替代方案值得一提。您可以编写一个简单的脚本,将所有文件合并为一个。这样您就可以避免向集群的单个节点生成大量网络流量。
回答by Amar
You can set partitions as 1 to save as single file
您可以将分区设置为 1 以保存为单个文件
dataFrame.write.repartitions(1).format("parquet").mode("append").save("temp.parquet")

