scala 如何将 DataFrame 保存为压缩(gzipped)CSV?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40163996/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:46:13  来源:igfitidea点击:

How to save a DataFrame as compressed (gzipped) CSV?

scalacsvapache-sparkspark-dataframe

提问by user2628641

I use Spark 1.6.0 and Scala.

我使用 Spark 1.6.0 和 Scala。

I want to save a DataFrame as compressed CSV format.

我想将 DataFrame 保存为压缩的 CSV 格式。

Here is what I have so far (assume I already have dfand scas SparkContext):

这是我到目前为止所拥有的(假设我已经拥有dfscas SparkContext):

//set the conf to the codec I want
sc.getConf.set("spark.hadoop.mapred.output.compress", "true")
sc.getConf.set("spark.hadoop.mapred.output.compression.codec", "true")
sc.getConf.set("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec")
sc.getConf.set("spark.hadoop.mapred.output.compression.type", "BLOCK")

df.write
  .format("com.databricks.spark.csv")
  .save(my_directory)

The output is not in gzformat.

输出不是gz格式。

采纳答案by Alex-Antoine Fortin

On the spark-csv github: https://github.com/databricks/spark-csv

在 spark-csv github 上:https: //github.com/databricks/spark-csv

One can read:

可以读一读:

codec: compression codec to use when saving to file. Should be the fully qualified name of a class implementing org.apache.hadoop.io.compress.CompressionCodec or one of case-insensitive shorten names (bzip2, gzip, lz4, and snappy). Defaults to no compression when a codec is not specified.

codec: 保存到文件时使用的压缩编解码器。应该是实现 org.apache.hadoop.io.compress.CompressionCodec 的类的完全限定名称或不区分大小写的缩写名称之一(bzip2、gzip、lz4 和 snappy)。未指定编解码器时默认为无压缩。

In your case, this should work: df.write.format("com.databricks.spark.csv").codec("gzip")\ .save('my_directory/my_file.gzip')

在您的情况下,这应该有效: df.write.format("com.databricks.spark.csv").codec("gzip")\ .save('my_directory/my_file.gzip')

回答by Ravi Kant Saini

This code works for Spark 2.1, where .codecis not available.

此代码适用于 Spark 2.1,其中.codec不可用。

df.write
  .format("com.databricks.spark.csv")
  .option("codec", "org.apache.hadoop.io.compress.GzipCodec")
  .save(my_directory)

For Spark 2.2, you can use the df.write.csv(...,codec="gzip")option described here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=codec

对于 Spark 2.2,您可以使用df.write.csv(...,codec="gzip")此处描述的选项:https: //spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=codec

回答by Nick Chammas

With Spark 2.0+, this has become a bit simpler:

使用 Spark 2.0+,这变得更简单了:

df.write.csv("path", compression="gzip")

You don't need the external Databricks CSV package anymore.

您不再需要外部 Databricks CSV 包。

The csv()writer supports a number of handy options. For example:

csv()作家支持多种方便的选择。例如:

  • sep: To set the separator character.
  • quote: Whether and how to quote values.
  • header: Whether to include a header line.
  • sep: 设置分隔符。
  • quote: 是否以及如何引用值。
  • header: 是否包含标题行。

There are also a number of other compression codecs you can use, in addition to gzip:

除了以下内容外,您还可以使用许多其他压缩编解码器gzip

  • bzip2
  • lz4
  • snappy
  • deflate
  • bzip2
  • lz4
  • snappy
  • deflate

The full Spark docs for the csv()writer are here: Python/ Scala

作者的完整 Spark 文档在csv()这里:Python/ Scala

回答by morfious902002

To write the CSV file with headers and rename the part-000 file to .csv.gzip

写入带有标题的 CSV 文件并将 part-000 文件重命名为 .csv.gzip

DF.coalesce(1).write.format("com.databricks.spark.csv").mode("overwrite")
.option("header","true")
.option("codec",org.apache.hadoop.io.compress.GzipCodec").save(tempLocationFileName)

copyRename(tempLocationFileName, finalLocationFileName)

def copyRename(srcPath: String, dstPath: String): Unit =  {
  val hadoopConfig = new Configuration()
  val hdfs = FileSystem.get(hadoopConfig)
  FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), true, hadoopConfig, null)
  // the "true" setting deletes the source files once they are merged into the new output
}

If you don't need the header then set it to false and you wouldn't need to do the coalesce either. It will be faster to write too.

如果您不需要标头,则将其设置为 false 并且您也不需要进行合并。写起来也会更快。