scala 如何将 DataFrame 保存为压缩（gzipped）CSV？

Question

提问by user2628641

I use Spark 1.6.0 and Scala.

我使用 Spark 1.6.0 和 Scala。

I want to save a DataFrame as compressed CSV format.

我想将 DataFrame 保存为压缩的 CSV 格式。

Here is what I have so far (assume I already have dfand scas SparkContext):

这是我到目前为止所拥有的（假设我已经拥有df和scas SparkContext）：

//set the conf to the codec I want
sc.getConf.set("spark.hadoop.mapred.output.compress", "true")
sc.getConf.set("spark.hadoop.mapred.output.compression.codec", "true")
sc.getConf.set("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec")
sc.getConf.set("spark.hadoop.mapred.output.compression.type", "BLOCK")

df.write
  .format("com.databricks.spark.csv")
  .save(my_directory)

The output is not in gzformat.

输出不是gz格式。

Answer 1

采纳答案by Alex-Antoine Fortin

On the spark-csv github: https://github.com/databricks/spark-csv

在 spark-csv github 上：https: //github.com/databricks/spark-csv

One can read:

可以读一读：

codec: compression codec to use when saving to file. Should be the fully qualified name of a class implementing org.apache.hadoop.io.compress.CompressionCodec or one of case-insensitive shorten names (bzip2, gzip, lz4, and snappy). Defaults to no compression when a codec is not specified.

codec: 保存到文件时使用的压缩编解码器。应该是实现 org.apache.hadoop.io.compress.CompressionCodec 的类的完全限定名称或不区分大小写的缩写名称之一（bzip2、gzip、lz4 和 snappy）。未指定编解码器时默认为无压缩。

In your case, this should work: df.write.format("com.databricks.spark.csv").codec("gzip")\ .save('my_directory/my_file.gzip')

在您的情况下，这应该有效： df.write.format("com.databricks.spark.csv").codec("gzip")\ .save('my_directory/my_file.gzip')

Answer 2

回答by Ravi Kant Saini

This code works for Spark 2.1, where .codecis not available.

此代码适用于 Spark 2.1，其中.codec不可用。

df.write
  .format("com.databricks.spark.csv")
  .option("codec", "org.apache.hadoop.io.compress.GzipCodec")
  .save(my_directory)

For Spark 2.2, you can use the df.write.csv(...,codec="gzip")option described here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=codec

对于 Spark 2.2，您可以使用df.write.csv(...,codec="gzip")此处描述的选项：https: //spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=codec

Answer 3

回答by Nick Chammas

With Spark 2.0+, this has become a bit simpler:

使用 Spark 2.0+，这变得更简单了：

df.write.csv("path", compression="gzip")

You don't need the external Databricks CSV package anymore.

您不再需要外部 Databricks CSV 包。

The csv()writer supports a number of handy options. For example:

该csv()作家支持多种方便的选择。例如：

sep: To set the separator character.
quote: Whether and how to quote values.
header: Whether to include a header line.

sep: 设置分隔符。
quote: 是否以及如何引用值。
header: 是否包含标题行。

There are also a number of other compression codecs you can use, in addition to gzip:

除了以下内容外，您还可以使用许多其他压缩编解码器gzip：

bzip2
lz4
snappy
deflate

bzip2
lz4
snappy
deflate

The full Spark docs for the csv()writer are here: Python/ Scala

作者的完整 Spark 文档在csv()这里：Python/ Scala

Answer 4

回答by morfious902002

To write the CSV file with headers and rename the part-000 file to .csv.gzip

写入带有标题的 CSV 文件并将 part-000 文件重命名为 .csv.gzip

DF.coalesce(1).write.format("com.databricks.spark.csv").mode("overwrite")
.option("header","true")
.option("codec",org.apache.hadoop.io.compress.GzipCodec").save(tempLocationFileName)

copyRename(tempLocationFileName, finalLocationFileName)

def copyRename(srcPath: String, dstPath: String): Unit =  {
  val hadoopConfig = new Configuration()
  val hdfs = FileSystem.get(hadoopConfig)
  FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), true, hadoopConfig, null)
  // the "true" setting deletes the source files once they are merged into the new output
}

If you don't need the header then set it to false and you wouldn't need to do the coalesce either. It will be faster to write too.

如果您不需要标头，则将其设置为 false 并且您也不需要进行合并。写起来也会更快。

scala 如何将 DataFrame 保存为压缩（gzipped）CSV？

提问by user2628641

采纳答案by Alex-Antoine Fortin

回答by Ravi Kant Saini

回答by Nick Chammas

回答by morfious902002

相关推荐

最近更新

标签

scala 如何将 DataFrame 保存为压缩（gzipped）CSV？

提问by user2628641

采纳答案by Alex-Antoine Fortin

回答by Ravi Kant Saini

回答by Nick Chammas

回答by morfious902002

相关推荐

scala Scala在两个字符集之间转换字符串

scala 如何从火花数据框中过滤掉空值

scala 如何读取akka-http中的查询参数？

scala 如何在scala中获取当前日期，月份，年份

相关推荐

最近更新

标签