scala 这是从 S3 读取 Json 文件的最快方法:Spark

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38214633/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:26:41  来源:igfitidea点击:

Which is the fastest way to read Json Files from S3 : Spark

jsonscalaapache-sparkamazon-s3pyspark

提问by Splee

I have a directory with folders and each folder contains compressed JSON file (.gz). Currently I am doing like:

我有一个包含文件夹的目录,每个文件夹都包含压缩的 JSON 文件 ( .gz)。目前我正在做:

val df = sqlContext.jsonFile("s3://testData/*/*/*")
df.show()

Eg:

例如:

testData/May/01/00/File.json.gz

Each compressed file is about 11 to 17 GB.

每个压缩文件大约为 11 到 17 GB。

I have:

我有:

  1. Master: 1 c3.4xlarge
  2. Core: 19 c3.4xlarge
  3. Spark 1.5.2
  4. emr-4.2.0
  1. 主:1 c3.4xlarge
  2. 核心:19 c3.4xlarge
  3. 火花 1.5.2
  4. emr-4.2.0

The compressed files have multiple json objects/file. This process takes huge amount of time just to read (just the the above two statements). Is there any faster way to do this? The schema is little complex as well. I am planning to write some queries to analysis the data set. But I am worried about the time it takes to read data from s3.

压缩文件有多个 json 对象/文件。这个过程需要大量的时间来阅读(只是上面的两个语句)。有没有更快的方法来做到这一点?模式也不太复杂。我打算写一些查询来分析数据集。但是我担心从s3读取数据需要的时间。

Maximum load can be 10TB. I am planning to use cache to process queries later.

最大负载可以是 10TB。我打算稍后使用缓存来处理查询。

回答by Splee

If your JSON is uniformly structured I would advise you to give Spark the schema for your JSON files and this should speed up processing tremendously.

如果您的 JSON 是统一结构的,我建议您为 Spark 提供 JSON 文件的架构,这应该会大大加快处理速度。

When you don't supply a schema Spark will read all of the lines in the file first to infer the schema which, as you have observed, can take a while.

当您不提供架构时,Spark 将首先读取文件中的所有行以推断架构,正如您所观察到的,这可能需要一段时间。

See this documentation for how to create a schema: http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema

有关如何创建架构的信息,请参阅此文档:http: //spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema

Then you'd just have to add the schema you created to the jsonFile call:

然后,您只需将您创建的架构添加到 jsonFile 调用中:

val df = sqlContext.jsonFile("s3://testData/*/*/*", mySchema)

At this time (I'm using Spark 1.6.2) it seems as if jsonFilehas been deprecated, so switching to sqlContext.read.schema(mySchema).json(myJsonRDD)(where myJsonRDDis of type RDD[String]) might be preferable.

此时(我使用的是 Spark 1.6.2)似乎jsonFile已被弃用,因此切换到sqlContext.read.schema(mySchema).json(myJsonRDD)(where myJsonRDDis of type RDD[String]) 可能更可取。