如何在 Scala 中将 DataFrame 导出到 csv?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32527519/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 07:35:46  来源:igfitidea点击:

How to export DataFrame to csv in Scala?

scalacsvapache-spark

提问by Tong

How can I export Spark's DataFrame to csv file using Scala?

如何使用 Scala 将 Spark 的 DataFrame 导出到 csv 文件?

回答by karthik manchala

Easiest and best way to do this is to use spark-csvlibrary. You can check the documentation in the provided link and hereis the scala example of how to load and save data from/to DataFrame.

最简单和最好的方法是使用spark-csv库。您可以查看提供的链接中的文档,它here是如何从/向 DataFrame 加载和保存数据的 Scala 示例。

Code (Spark 1.4+):

代码(Spark 1.4+):

dataFrame.write.format("com.databricks.spark.csv").save("myFile.csv")

Edit:

编辑:

Spark creates part-files while saving the csv data, if you want to merge the part-files into a single csv, refer the following:

Spark在保存csv数据的同时会创建part-files,如果你想将part-files合并成一个单独的csv,参考如下:

Merge Spark's CSV output folder to Single File

将 Spark 的 CSV 输出文件夹合并到单个文件

回答by Taylrl

In Spark verions 2+you can simply use the following;

Spark 版本 2+ 中,您可以简单地使用以下内容;

df.write.csv("/your/location/data.csv")

If you want to make sure that the files are no longer partitioned then add a .coalesce(1)as follows;

如果要确保文件不再分区,请添加.coalesce(1)如下;

df.coalesce(1).write.csv("/your/location/data.csv")

回答by Abu Shoeb

Above solution exports csv as multiple partitions. I found another solution by zero323on this stackoverflow pagethat exports a dataframe into one single CSV file when you use coalesce.

以上解决方案将 csv 导出为多个分区。我在这个stackoverflow 页面上找到了zero323 的另一个解决方案,当您使用.coalesce

df.coalesce(1)
  .write.format("com.databricks.spark.csv")
  .option("header", "true")
  .save("/your/location/mydata")

This would create a directory named mydatawhere you'll find a csvfile that contains the results.

这将创建一个名为的目录mydata,您将在其中找到csv包含结果的文件。