Java Spark 支持 gzip 格式吗？

Question

提问by ptikobj

For a Big Data project, I'm planning to use spark, which has some nice features like in-memory-computations for repeated workloads. It can run on local files or on top of HDFS.

对于大数据项目，我计划使用spark，它具有一些不错的功能，例如用于重复工作负载的内存计算。它可以在本地文件或 HDFS 之上运行。

However, in the official documentation, I can't find any hint as to how to process gzipped files. In practice, it can be quite efficient to process .gz files instead of unzipped files.

但是，在官方文档中，我找不到有关如何处理 gzip 文件的任何提示。在实践中，处理 .gz 文件而不是解压缩文件会非常有效。

Is there a way to manually implement reading of gzipped files or is unzipping already automatically done when reading a .gz file?

有没有办法手动实现读取 gzipped 文件，或者在读取 .gz 文件时解压已经自动完成？

Answer 1

采纳答案by Josh Rosen

From the Spark Scala Programming guide's section on "Hadoop Datasets":

从 Spark Scala 编程指南的“Hadoop 数据集”部分：

Spark can create distributed datasets from any file stored in the Hadoop distributed file system (HDFS) or other storage systems supported by Hadoop (including your local file system, Amazon S3, Hypertable, HBase, etc). Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.

Spark 可以从存储在 Hadoop 分布式文件系统 (HDFS) 或 Hadoop 支持的其他存储系统（包括您的本地文件系统、Amazon S3、Hypertable、HBase 等）中的任何文件创建分布式数据集。Spark 支持文本文件、SequenceFiles 和任何其他 Hadoop InputFormat。

Support for gzip input files should work the same as it does in Hadoop. For example, sc.textFile("myFile.gz")should automatically decompress and read gzip-compressed files (textFile()is actually implementedusing Hadoop's TextInputFormat, which supports gzip-compressed files).

对 gzip 输入文件的支持应该与它在 Hadoop 中的工作方式相同。例如，sc.textFile("myFile.gz")应该自动解压和读取gzip 压缩文件（textFile()实际上是使用Hadoop实现的TextInputFormat，它支持gzip 压缩文件）。

As mentioned by @nick-chammas in the comments:

正如@nick-chammas 在评论中提到的：

note that if you call sc.textFile()on a gzipped file, Spark will give you an RDD with only 1 partition (as of 0.9.0). This is because gzipped files are not splittable. If you don't repartition the RDD somehow, any operations on that RDD will be limited to a single core

请注意，如果您调用sc.textFile()gzipped 文件，Spark 将为您提供一个只有 1 个分区的 RDD（从 0.9.0 开始）。这是因为 gzip 压缩的文件不可拆分。如果您不以某种方式重新分区 RDD，则对该 RDD 的任何操作都将仅限于单个核心

Java Spark 支持 gzip 格式吗？

提问by ptikobj

采纳答案by Josh Rosen

相关推荐

最近更新

标签

Java Spark 支持 gzip 格式吗？

提问by ptikobj

采纳答案by Josh Rosen

相关推荐

Java 中的全局变量，可以在包中的任何类中访问它

在java中连接自动生成的字符串，中间有一个空格分隔符

用于 Java 的 REST API？

如何在 Java/Groovy 中将 InputStream 转换为 BufferedImage？

相关推荐

最近更新

标签