scala 将压缩在 tar.gz 存档中的多个文件读入 Spark

Question

提问by septra

I'm trying to create a Spark RDD from several json files compressed into a tar. For example, I have 3 files

我正在尝试从压缩成 tar 的几个 json 文件创建一个 Spark RDD。例如，我有 3 个文件

file1.json
file2.json
file3.json

And these are contained in archive.tar.gz.

而这些都包含在archive.tar.gz.

I want to create a dataframe from the json files. The problem is that Spark is not reading in the json files correctly. Creating an RDD using sqlContext.read.json("archive.tar.gz")or sc.textFile("archive.tar.gz")results in garbled/extra output.

我想从 json 文件创建一个数据框。问题是 Spark 没有正确读取 json 文件。使用sqlContext.read.json("archive.tar.gz")或创建 RDD 会sc.textFile("archive.tar.gz")导致乱码/额外输出。

Is there some way to handle gzipped archives containing multiple files in Spark?

有没有办法在 Spark 中处理包含多个文件的 gzip 压缩档案？

UPDATE

更新

Using the method given in the answer to Read whole text files from a compression in SparkI was able to get things running, but this method does not seem to be suitable for large tar.gz archives (>200 mb compressed) as the application chokes up on large archive sizes. As some of the archives I'm dealing with reach sizes upto 2 GBafter compression I'm wondering if there is some efficient way to deal with the problem.

使用从 Spark 中的压缩中读取整个文本文件的答案中给出的方法我能够让事情运行，但是这种方法似乎不适合大型 tar.gz 存档（> 200 mb 压缩），因为应用程序阻塞大档案大小。作为我处理压缩后可达2 GB 的一些档案，我想知道是否有一些有效的方法来处理这个问题。

I'm trying to avoid extracting the archives and then merging the files together as this would be time consuming.

我试图避免提取档案然后将文件合并在一起，因为这会很耗时。

Answer 1

采纳答案by septra

A solution is given in Read whole text files from a compression in Spark. Using the code sample provided, I was able to create a dataframe from the compressed archive like so:

从 Spark 中的压缩中读取整个文本文件中给出了一个解决方案。使用提供的代码示例，我能够从压缩档案中创建一个数据帧，如下所示：

val jsonRDD = sc.binaryFiles("gzarchive/*").
               flatMapValues(x => extractFiles(x).toOption).
               mapValues(_.map(decode())

val df = sqlContext.read.json(jsonRDD.map(_._2).flatMap(x => x))

This method works fine for tar archives of a relatively small size, but is not suitable for large archive sizes.

此方法适用于相对较小的 tar 档案，但不适用于大档案。

A better solution to the problem seems to be to convert the tar archives to hadoop SequenceFiles, which are splittable and hence can be read and processed in parallel in Spark (as opposed to tar archives.)

该问题的更好解决方案似乎是将 tar 档案转换为 hadoop SequenceFiles，它是可拆分的，因此可以在 Spark 中并行读取和处理（与 tar 档案相反）。

See: stuartsierra.com/2008/04/24/a-million-little-files

参见：stuartsierra.com/2008/04/24/a-million-little-files

Answer 2

回答by DJHenjin

Files inside of a *.tar.gz file, as you already have mentioned are compressed. You cannot put the 3 files into a single compressed tar file and expect the import function (which is looking for only text) to know how to handle decompressing the files, unpacking them from the tar archive, and then importing each file individually.

*.tar.gz 文件中的文件，正如您已经提到的那样被压缩。您不能将这 3 个文件放入一个压缩的 tar 文件中，并期望导入函数（它只查找文本）知道如何处理解压缩文件、从 tar 存档中解压缩它们，然后单独导入每个文件。

I would recommend you take the time to upload each individual json file manually since both sc.textfile and sqlcontext.read.json functions cannot handle compressed data.

我建议您花时间手动上传每个单独的 json 文件，因为 sc.textfile 和 sqlcontext.read.json 函数都无法处理压缩数据。

scala 将压缩在 tar.gz 存档中的多个文件读入 Spark

提问by septra

采纳答案by septra

回答by DJHenjin

相关推荐

最近更新

标签

scala 将压缩在 tar.gz 存档中的多个文件读入 Spark

提问by septra

采纳答案by septra

回答by DJHenjin

相关推荐

scala 使用单个标题合并 Spark 输出 CSV 文件

如何从 Scala 的可迭代列表创建 DataFrame？

scala 并行化/避免火花中的 foreach 循环

scala 如何更改火花数据框中的列位置？

相关推荐

最近更新

标签