scala 将 DataFrame 保存为 CSV 时指定文件名
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/41990086/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Specifying the filename when saving a DataFrame as a CSV
提问by Spandan Brahmbhatt
Say I have a Spark DF that I want to save to disk a CSV file.  In Spark 2.0.0+, one can convert DataFrame(DataSet[Rows])as a DataFrameWriterand use the .csvmethod to write the file.
假设我有一个 Spark DF,我想将一个 CSV 文件保存到磁盘。在 Spark 2.0.0+ 中,可以转换DataFrame(DataSet[Rows])为 aDataFrameWriter并使用该.csv方法写入文件。
The function is defined as
该函数定义为
def csv(path: String): Unit
    path : the location/folder name and not the file name.
Spark stores the csv file at the location specified by creating CSV files with name - part-*.csv.
Spark 将 csv 文件存储在通过创建名称为 part-*.csv 的 CSV 文件指定的位置。
Is there a way to save the CSV with specified filename instead of part-*.csv ? Or possible to specify prefix to instead of part-r ?
有没有办法用指定的文件名而不是 part-*.csv 保存 CSV ?或者可以指定前缀代替 part-r ?
Code :
代码 :
df.coalesce(1).write.csv("sample_path")
Current Output :
电流输出:
sample_path
|
+-- part-r-00000.csv
Desired Output :
所需的输出:
sample_path
|
+-- my_file.csv
Note :The coalesce function is used to output a single file and the executor has enough memory to collect the DF without memory error.
注意:coalesce 函数用于输出单个文件,执行器有足够的内存来收集 DF,不会出现内存错误。
回答by T. Gaw?da
It's not possible to do it directly in Spark's save
无法直接在 Spark 中执行此操作 save
Spark uses Hadoop File Format, which requires data to be partitioned - that's why you have part-files. You can easily change filename after processing just like in thisquestion
Spark 使用 Hadoop 文件格式,它需要对数据进行分区 - 这就是您拥有part-文件的原因。您可以在处理后轻松更改文件名,就像在这个问题中一样
In Scala it will look like:
在 Scala 中,它看起来像:
import org.apache.hadoop.fs._
val fs = FileSystem.get(sc.hadoopConfiguration)
val file = fs.globStatus(new Path("path/file.csv/part*"))(0).getPath().getName()
fs.rename(new Path("csvDirectory/" + file), new Path("mydata.csv"))
fs.delete(new Path("mydata.csv-temp"), true)
or just:
要不就:
import org.apache.hadoop.fs._
val fs = FileSystem.get(sc.hadoopConfiguration)
fs.rename(new Path("csvDirectory/data.csv/part-0000"), new Path("csvDirectory/newData.csv"))
Edit: As mentioned in comments, you can also write your own OutputFormat, please see documents for informationabout this approach to set file name
编辑:如评论中所述,您也可以编写自己的 OutputFormat,有关这种设置文件名的方法的信息,请参阅文档

