scala Spark saveAsTextFile() 写入多个文件而不是一个

Question

提问by stackoverflowuser2010

I am using Spark and Scala on my laptop at this moment.

目前我正在笔记本电脑上使用 Spark 和 Scala。

When I write an RDD to a file, the output is written to two files "part-00000" and "part-00001". How can I force Spark / Scala to write to one file?

当我将 RDD 写入文件时，输出将写入两个文件“part-00000”和“part-00001”。如何强制 Spark/Scala写入一个文件？

My code is currently:

我的代码目前是：

myRDD.map(x => x._1 + "," + x._2).saveAsTextFile("/path/to/output")

where I am removing the parenthesesto write out key,value pairs.

我正在删除括号以写出键值对。

Answer 1

回答by Alberto Bonsanto

The "problem" is indeed a feature, and it is produced by how your RDDis partitioned, hence it is separated in nparts where nis the number of partitions. To fix this you just need to change the number of partitions to one, by using repartitionon your RDD. The documentation states:

“问题”确实是一个功能，它是由您RDD的分区方式产生的，因此它被分成n几部分，其中n是分区数。为了解决这个问题，你只需要分区的数量更改为，通过重新分区您RDD。文档指出：

repartition(numPartitions)
Return a new RDD that has exactly numPartitions partitions.
Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data. If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a shuffle.

重新分区（数量分区）
返回一个刚好有 numPartitions 个分区的新 RDD。
可以增加或减少此 RDD 中的并行度。在内部，这使用 shuffle 来重新分配数据。如果您要减少此 RDD 中的分区数，请考虑使用合并，这样可以避免执行 shuffle。

For example, this change should work.

例如，此更改应该有效。

myRDD.map(x => x._1 + "," + x._2).repartition(1).saveAsTextFile("/path/to/output")

As the documentation says you can also use coalesce, which is actually the recommended option when you are reducing the number of partitions. However, reducing the number of partitions to one is considered a bad idea, because it causes shuffling of the data to one node and loss of parallelism.

正如文档所说，您还可以使用coalesce，这实际上是减少分区数量时的推荐选项。然而，将分区数量减少到一个被认为是一个坏主意，因为它会导致数据混洗到一个节点并失去并行性。

scala Spark saveAsTextFile() 写入多个文件而不是一个

提问by stackoverflowuser2010

回答by Alberto Bonsanto

相关推荐

最近更新

标签

scala Spark saveAsTextFile() 写入多个文件而不是一个

提问by stackoverflowuser2010

回答by Alberto Bonsanto

相关推荐

scala Spark 应用程序中的垃圾收集时间非常长，导致程序停止

scala 如何将函数应用于 Spark DataFrame 的列？

在 Scala 中将选项转换为任一

scala 如何测试客户端 Akka HTTP

相关推荐

最近更新

标签