scala 将 DataFrame 保存为 CSV 时指定文件名

Question

提问by Spandan Brahmbhatt

Say I have a Spark DF that I want to save to disk a CSV file. In Spark 2.0.0+, one can convert DataFrame(DataSet[Rows])as a DataFrameWriterand use the .csvmethod to write the file.

假设我有一个 Spark DF，我想将一个 CSV 文件保存到磁盘。在 Spark 2.0.0+ 中，可以转换DataFrame(DataSet[Rows])为 aDataFrameWriter并使用该.csv方法写入文件。

The function is defined as

该函数定义为

def csv(path: String): Unit
    path : the location/folder name and not the file name.

Spark stores the csv file at the location specified by creating CSV files with name - part-*.csv.

Spark 将 csv 文件存储在通过创建名称为 part-*.csv 的 CSV 文件指定的位置。

Is there a way to save the CSV with specified filename instead of part-*.csv ? Or possible to specify prefix to instead of part-r ?

有没有办法用指定的文件名而不是 part-*.csv 保存 CSV ？或者可以指定前缀代替 part-r ？

Code :

代码：

df.coalesce(1).write.csv("sample_path")

Current Output :

电流输出：

sample_path
|
+-- part-r-00000.csv

Desired Output :

所需的输出：

sample_path
|
+-- my_file.csv

Note :The coalesce function is used to output a single file and the executor has enough memory to collect the DF without memory error.

注意：coalesce 函数用于输出单个文件，执行器有足够的内存来收集 DF，不会出现内存错误。

Answer 1

回答by T. Gaw?da

It's not possible to do it directly in Spark's save

无法直接在 Spark 中执行此操作 save

Spark uses Hadoop File Format, which requires data to be partitioned - that's why you have part-files. You can easily change filename after processing just like in thisquestion

Spark 使用 Hadoop 文件格式，它需要对数据进行分区 - 这就是您拥有part-文件的原因。您可以在处理后轻松更改文件名，就像在这个问题中一样

In Scala it will look like:

在 Scala 中，它看起来像：

import org.apache.hadoop.fs._
val fs = FileSystem.get(sc.hadoopConfiguration)
val file = fs.globStatus(new Path("path/file.csv/part*"))(0).getPath().getName()

fs.rename(new Path("csvDirectory/" + file), new Path("mydata.csv"))
fs.delete(new Path("mydata.csv-temp"), true)

or just:

要不就：

import org.apache.hadoop.fs._
val fs = FileSystem.get(sc.hadoopConfiguration)
fs.rename(new Path("csvDirectory/data.csv/part-0000"), new Path("csvDirectory/newData.csv"))

Edit: As mentioned in comments, you can also write your own OutputFormat, please see documents for informationabout this approach to set file name

编辑：如评论中所述，您也可以编写自己的 OutputFormat，有关这种设置文件名的方法的信息，请参阅文档

scala 将 DataFrame 保存为 CSV 时指定文件名

提问by Spandan Brahmbhatt

回答by T. Gaw?da

相关推荐

最近更新

标签

scala 将 DataFrame 保存为 CSV 时指定文件名

提问by Spandan Brahmbhatt

回答by T. Gaw?da

相关推荐

scala 以秒为单位的 Spark SQL datediff

scala 对象 SparkSession 不是包 org.apache.spark.sql 的成员

scala scalatest 的“未知工件。未解析或索引”错误

scala 解决 Apache Spark 中的依赖问题

相关推荐

最近更新

标签