scala Spark saveAsTextFile() 写入多个文件而不是一个
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35445486/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Spark saveAsTextFile() writes to multiple files instead of one
提问by stackoverflowuser2010
I am using Spark and Scala on my laptop at this moment.
目前我正在笔记本电脑上使用 Spark 和 Scala。
When I write an RDD to a file, the output is written to two files "part-00000" and "part-00001". How can I force Spark / Scala to write to one file?
当我将 RDD 写入文件时,输出将写入两个文件“part-00000”和“part-00001”。如何强制 Spark/Scala写入一个文件?
My code is currently:
我的代码目前是:
myRDD.map(x => x._1 + "," + x._2).saveAsTextFile("/path/to/output")
where I am removing the parenthesesto write out key,value pairs.
我正在删除括号以写出键值对。
回答by Alberto Bonsanto
The "problem" is indeed a feature, and it is produced by how your RDDis partitioned, hence it is separated in nparts where nis the number of partitions. To fix this you just need to change the number of partitions to one, by using repartitionon your RDD. The documentation states:
“问题”确实是一个功能,它是由您RDD的分区方式产生的,因此它被分成n几部分,其中n是分区数。为了解决这个问题,你只需要分区的数量更改为,通过重新分区您RDD。文档指出:
repartition(numPartitions)
Return a new RDD that has exactly numPartitions partitions.
Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data. If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a shuffle.
重新分区(数量分区)
返回一个刚好有 numPartitions 个分区的新 RDD。
可以增加或减少此 RDD 中的并行度。在内部,这使用 shuffle 来重新分配数据。如果您要减少此 RDD 中的分区数,请考虑使用合并,这样可以避免执行 shuffle。
For example, this change should work.
例如,此更改应该有效。
myRDD.map(x => x._1 + "," + x._2).repartition(1).saveAsTextFile("/path/to/output")
As the documentation says you can also use coalesce, which is actually the recommended option when you are reducing the number of partitions. However, reducing the number of partitions to one is considered a bad idea, because it causes shuffling of the data to one node and loss of parallelism.
正如文档所说,您还可以使用coalesce,这实际上是减少分区数量时的推荐选项。然而,将分区数量减少到一个被认为是一个坏主意,因为它会导致数据混洗到一个节点并失去并行性。

