java 如何有效地将多个 json 文件读入 Dataframe 或 JavaRDD?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33710898/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I efficiently read multiple json files into a Dataframe or JavaRDD?
提问by Abu Sulaiman
I can use the following code to read a single json file but I need to read multiple json files and merge them into one Dataframe. How can I do this?
我可以使用以下代码读取单个 json 文件,但我需要读取多个 json 文件并将它们合并到一个 Dataframe 中。我怎样才能做到这一点?
DataFrame jsondf = sqlContext.read().json("/home/spark/articles/article.json");
Or is there a way to read multiple json files into JavaRDD then convert to Dataframe?
或者有没有办法将多个json文件读入JavaRDD然后转换为Dataframe?
回答by zero323
You can use exactly the same code to read multiple JSON files. Just pass a path-to-a-directory / path-with-wildcards instead of path to a single file.
您可以使用完全相同的代码来读取多个 JSON 文件。只需传递路径到目录/带通配符的路径而不是单个文件的路径。
DataFrameReader
also provides json
method with a following signature:
DataFrameReader
还提供json
具有以下签名的方法:
json(jsonRDD: JavaRDD[String])
which can be used to parse JSON already loaded into JavaRDD
.
可用于解析已加载到JavaRDD
.
回答by tjriggs
To read multiple inputs in Spark, use wildcards. That's going to be true whether you're constructing a dataframe or an rdd.
要在 Spark 中读取多个输入,请使用通配符。无论您是在构建数据框还是 rdd,这都是正确的。
context.read().json("/home/spark/articles/*.json")
// or getting json out of s3
context.read().json("s3n://bucket/articles/201510*/*.json")
回答by asheesh kumar singhal
function spark.read.json
accepts list of file as a parameter.
函数spark.read.json
接受文件列表作为参数。
spark.read.json(List_all_json file)
This will read all the files in the list and return a single data frame for all the information in the files.
这将读取列表中的所有文件并返回文件中所有信息的单个数据框。
回答by dmigo
Function json(String... paths)
takes variable arguments. (documentation)
函数json(String... paths)
采用可变参数。(文档)
So you can change your code like this:
所以你可以像这样改变你的代码:
sqlContext.read().json(file1, file2, ...)