Java Spark 支持 gzip 格式吗?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16302385/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-16 06:13:59  来源:igfitidea点击:

Is gzip format supported in Spark?

javascalamapreducegzipapache-spark

提问by ptikobj

For a Big Data project, I'm planning to use spark, which has some nice features like in-memory-computations for repeated workloads. It can run on local files or on top of HDFS.

对于大数据项目,我计划使用spark,它具有一些不错的功能,例如用于重复工作负载的内存计算。它可以在本地文件或 HDFS 之上运行。

However, in the official documentation, I can't find any hint as to how to process gzipped files. In practice, it can be quite efficient to process .gz files instead of unzipped files.

但是,在官方文档中,我找不到有关如何处理 gzip 文件的任何提示。在实践中,处理 .gz 文件而不是解压缩文件会非常有效。

Is there a way to manually implement reading of gzipped files or is unzipping already automatically done when reading a .gz file?

有没有办法手动实现读取 gzipped 文件,或者在读取 .gz 文件时解压已经自动完成?

采纳答案by Josh Rosen

From the Spark Scala Programming guide's section on "Hadoop Datasets":

从 Spark Scala 编程指南的“Hadoop 数据集”部分

Spark can create distributed datasets from any file stored in the Hadoop distributed file system (HDFS) or other storage systems supported by Hadoop (including your local file system, Amazon S3, Hypertable, HBase, etc). Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.

Spark 可以从存储在 Hadoop 分布式文件系统 (HDFS) 或 Hadoop 支持的其他存储系统(包括您的本地文件系统、Amazon S3、Hypertable、HBase 等)中的任何文件创建分布式数据集。Spark 支持文本文件、SequenceFiles 和任何其他 Hadoop InputFormat。

Support for gzip input files should work the same as it does in Hadoop. For example, sc.textFile("myFile.gz")should automatically decompress and read gzip-compressed files (textFile()is actually implementedusing Hadoop's TextInputFormat, which supports gzip-compressed files).

对 gzip 输入文件的支持应该与它在 Hadoop 中的工作方式相同。例如,sc.textFile("myFile.gz")应该自动解压和读取gzip 压缩文件(textFile()实际上是使用Hadoop实现TextInputFormat,它支持gzip 压缩文件)。

As mentioned by @nick-chammas in the comments:

正如@nick-chammas 在评论中提到的:

note that if you call sc.textFile()on a gzipped file, Spark will give you an RDD with only 1 partition (as of 0.9.0). This is because gzipped files are not splittable. If you don't repartition the RDD somehow, any operations on that RDD will be limited to a single core

请注意,如果您调用sc.textFile()gzipped 文件,Spark 将为您提供一个只有 1 个分区的 RDD(从 0.9.0 开始)。这是因为 gzip 压缩的文件不可拆分。如果您不以某种方式重新分区 RDD,则对该 RDD 的任何操作都将仅限于单个核心