java 如何有效地将多个 json 文件读入 Dataframe 或 JavaRDD?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33710898/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-02 22:01:56  来源:igfitidea点击:

How can I efficiently read multiple json files into a Dataframe or JavaRDD?

javajsonapache-spark

提问by Abu Sulaiman

I can use the following code to read a single json file but I need to read multiple json files and merge them into one Dataframe. How can I do this?

我可以使用以下代码读取单个 json 文件,但我需要读取多个 json 文件并将它们合并到一个 Dataframe 中。我怎样才能做到这一点?

DataFrame jsondf = sqlContext.read().json("/home/spark/articles/article.json");

Or is there a way to read multiple json files into JavaRDD then convert to Dataframe?

或者有没有办法将多个json文件读入JavaRDD然后转换为Dataframe?

回答by zero323

You can use exactly the same code to read multiple JSON files. Just pass a path-to-a-directory / path-with-wildcards instead of path to a single file.

您可以使用完全相同的代码来读取多个 JSON 文件。只需传递路径到目录/带通配符的路径而不是单个文件的路径。

DataFrameReaderalso provides jsonmethod with a following signature:

DataFrameReader还提供json具有以下签名的方法

json(jsonRDD: JavaRDD[String])

which can be used to parse JSON already loaded into JavaRDD.

可用于解析已加载到JavaRDD.

回答by tjriggs

To read multiple inputs in Spark, use wildcards. That's going to be true whether you're constructing a dataframe or an rdd.

要在 Spark 中读取多个输入,请使用通配符。无论您是在构建数据框还是 rdd,这都是正确的。

context.read().json("/home/spark/articles/*.json")
// or getting json out of s3
context.read().json("s3n://bucket/articles/201510*/*.json")

回答by asheesh kumar singhal

function spark.read.jsonaccepts list of file as a parameter.

函数spark.read.json接受文件列表作为参数。

spark.read.json(List_all_json file)

This will read all the files in the list and return a single data frame for all the information in the files.

这将读取列表中的所有文件并返回文件中所有信息的单个数据框。

回答by dmigo

Function json(String... paths)takes variable arguments. (documentation)

函数json(String... paths)采用可变参数。(文档

So you can change your code like this:

所以你可以像这样改变你的代码:

sqlContext.read().json(file1, file2, ...)