scala 这是从 S3 读取 Json 文件的最快方法：Spark

Question

提问by Splee

I have a directory with folders and each folder contains compressed JSON file (.gz). Currently I am doing like:

我有一个包含文件夹的目录，每个文件夹都包含压缩的 JSON 文件 ( .gz)。目前我正在做：

val df = sqlContext.jsonFile("s3://testData/*/*/*")
df.show()

Eg:

例如：

testData/May/01/00/File.json.gz

Each compressed file is about 11 to 17 GB.

每个压缩文件大约为 11 到 17 GB。

I have:

我有：

Master: 1 c3.4xlarge
Core: 19 c3.4xlarge
Spark 1.5.2
emr-4.2.0

主：1 c3.4xlarge
核心：19 c3.4xlarge
火花 1.5.2
emr-4.2.0

The compressed files have multiple json objects/file. This process takes huge amount of time just to read (just the the above two statements). Is there any faster way to do this? The schema is little complex as well. I am planning to write some queries to analysis the data set. But I am worried about the time it takes to read data from s3.

压缩文件有多个 json 对象/文件。这个过程需要大量的时间来阅读（只是上面的两个语句）。有没有更快的方法来做到这一点？模式也不太复杂。我打算写一些查询来分析数据集。但是我担心从s3读取数据需要的时间。

Maximum load can be 10TB. I am planning to use cache to process queries later.

最大负载可以是 10TB。我打算稍后使用缓存来处理查询。

Answer 1

回答by Splee

If your JSON is uniformly structured I would advise you to give Spark the schema for your JSON files and this should speed up processing tremendously.

如果您的 JSON 是统一结构的，我建议您为 Spark 提供 JSON 文件的架构，这应该会大大加快处理速度。

When you don't supply a schema Spark will read all of the lines in the file first to infer the schema which, as you have observed, can take a while.

当您不提供架构时，Spark 将首先读取文件中的所有行以推断架构，正如您所观察到的，这可能需要一段时间。

See this documentation for how to create a schema: http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema

有关如何创建架构的信息，请参阅此文档：http: //spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema

Then you'd just have to add the schema you created to the jsonFile call:

然后，您只需将您创建的架构添加到 jsonFile 调用中：

val df = sqlContext.jsonFile("s3://testData/*/*/*", mySchema)

At this time (I'm using Spark 1.6.2) it seems as if jsonFilehas been deprecated, so switching to sqlContext.read.schema(mySchema).json(myJsonRDD)(where myJsonRDDis of type RDD[String]) might be preferable.

此时（我使用的是 Spark 1.6.2）似乎jsonFile已被弃用，因此切换到sqlContext.read.schema(mySchema).json(myJsonRDD)(where myJsonRDDis of type RDD[String]) 可能更可取。

scala 这是从 S3 读取 Json 文件的最快方法：Spark

提问by Splee

回答by Splee

相关推荐

最近更新

标签

scala 这是从 S3 读取 Json 文件的最快方法：Spark

提问by Splee

回答by Splee

相关推荐

scala 为什么 sbt 报告“错误：无法检索 sbt 0.13.11”？

scala 在 Spark Dataframe 中的列列表中添加一列 rowsum

scala 如何计算spark中DataFrame中列的百分比？

scala 如何在 Spark 2.X 数据集中创建自定义编码器？

相关推荐

最近更新

标签