java 使用 Apache Spark 读取 Json 文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/40212464/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Reading Json file using Apache Spark
提问by user6325753
I am trying to read Json file using Spark v2.0.0. In case of simple data code works really well. In case of little bit complex data, when i print df.show() the data is not showing in correct way.
我正在尝试使用 Spark v2.0.0 读取 Json 文件。在简单数据的情况下,代码工作得很好。如果数据有点复杂,当我打印 df.show() 时,数据没有以正确的方式显示。
here is my code:
这是我的代码:
SparkSession session = SparkSession.builder().master("local").appName("jsonreader").getOrCreate();
Dataset<Row> list = session.read().json("/Users/hadoop/Desktop/sample.json");
list.show();
Here is my sample data:
这是我的示例数据:
{
"glossary": {
"title": "example glossary",
"GlossDiv": {
"title": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": ["GML", "XML"]
},
"GlossSee": "markup"
}
}
}
}
}
And my output is like:
我的输出是这样的:
+--------------------+
| _corrupt_record|
+--------------------+
| {|
| "glossary": {|
| "title": ...|
| "GlossDiv": {|
| "titl...|
| "GlossList": {|
| "...|
| ...|
| "SortAs": "S...|
| "GlossTerm":...|
| "Acronym": "...|
| "Abbrev": "I...|
| "GlossDef": {|
| ...|
| "GlossSeeAl...|
| ...|
| "GlossSee": ...|
| }|
| }|
| }|
+--------------------+
only showing top 20 rows
回答by Ramachandran.A.G
You will need to format the JSON to one line if you have to read this JSON. This is a multi line JSON and hence is not being read and loaded properly (One Object one Row)
如果您必须阅读此 JSON,则需要将 JSON 格式化为一行。这是一个多行 JSON,因此没有被正确读取和加载(一个对象一行)
Quoting the JSON API :
引用 JSON API :
Loads a JSON file (one object per line) and returns the result as a DataFrame.
加载一个 JSON 文件(每行一个对象)并将结果作为 DataFrame 返回。
{"glossary":{"title":"example glossary","GlossDiv":{"title":"S","GlossList":{"GlossEntry":{"ID":"SGML","SortAs":"SGML","GlossTerm":"Standard Generalized Markup Language","Acronym":"SGML","Abbrev":"ISO 8879:1986","GlossDef":{"para":"A meta-markup language, used to create markup languages such as DocBook.","GlossSeeAlso":["GML","XML"]},"GlossSee":"markup"}}}}}
I just tried it on the shell , it should work from the code as well the same way (I had the same corrupted record error when i read a multi line JSON)
我只是在 shell 上尝试过,它也应该以相同的方式从代码中工作(当我读取多行 JSON 时,我遇到了相同的损坏记录错误)
scala> val df = spark.read.json("C:/DevelopmentTools/data.json")
df: org.apache.spark.sql.DataFrame = [glossary: struct<GlossDiv: struct<GlossList: struct<GlossEntry: struct<Abbrev: string, Acronym: string ... 5 more fields>>, title: string>, title: string>]
scala>
Edits :
编辑:
You can get the values out from that data frame using any action , for example
例如,您可以使用任何操作从该数据框中获取值
scala> df.select(df("glossary.GlossDiv.GlossList.GlossEntry.GlossTerm")).show()
+--------------------+
| GlossTerm|
+--------------------+
|Standard Generali...|
+--------------------+
scala>
You should be able to do it from your code as well
您也应该能够从您的代码中做到这一点
回答by Sandeep Purohit
Just make sure your json is in one line you are reading nested json so, if you already did this you are successfully loaded the json you are showing it in wrong way its nested json so you cant directly show, like if you want the title data of GlossDiv you can show it as follow
只需确保您的 json 在一行中,您正在阅读嵌套的 json,因此,如果您已经这样做了,则您已成功加载了 json,您以错误的方式显示它的嵌套 json,因此您无法直接显示,就像您想要标题数据一样的 GlossDiv 您可以将其显示如下
SparkSession session = SparkSession.builder().master("local").appName("jsonreader").getOrCreate();
Dataset<Row> list = session.read().json("/Users/hadoop/Desktop/sample.json");
list.select("glossary.GlossDiv.title") .show
回答by Sandeep Purohit
Try:
尝试:
session.read().json(session.sparkContext.wholeTextFiles("..."));
回答by skvyas
This thread is little old, I want to just elaborate on what @user6022341 has suggested. I ended up using it one of my projects:
这个线程有点旧,我想详细说明@user6022341 的建议。我最终使用它作为我的项目之一:
To process the multiline json file, wholeTextFiles(String path) transformation is the only solution in spark, if the file is one big json object. This transformation will load entire file content as a string. So, if in hdfs://a-hdfs-path directory you had two files namely, part-00000 and part-00001. Calling sparkContext.wholeTextFiles("hdfs://a-hdfs-path") will result in Spark returning a JavaPairRDD which has key as file name and value as the content of the file. This may not be the best solution and may hit performance for bigger files.
要处理多行 json 文件,如果文件是一个大 json 对象,则 WholeTextFiles(String path) 转换是 spark 中唯一的解决方案。此转换会将整个文件内容作为字符串加载。因此,如果在 hdfs://a-hdfs-path 目录中有两个文件,即 part-00000 和 part-00001。调用 sparkContext.wholeTextFiles("hdfs://a-hdfs-path") 将导致 Spark 返回一个 JavaPairRDD,它具有作为文件名的键和作为文件内容的值。这可能不是最佳解决方案,并且可能会影响较大文件的性能。
But if the multiline json file had multiple json objects split into multiple lines then you could probably use the hadoop.Configuration, some sample code is shown here. I haven't tested this out myself.
但是如果多行 json 文件有多个 json 对象分成多行,那么您可能可以使用 hadoop.Configuration,这里显示了一些示例代码。我自己没有测试过这个。
If you had to read a multiline csv file, you could do this with Spark 2.2
如果您必须读取多行 csv 文件,则可以使用 Spark 2.2 执行此操作
spark.read.csv(file, multiLine=True)
https://issues.apache.org/jira/browse/SPARK-19610
https://issues.apache.org/jira/browse/SPARK-19610
https://issues.apache.org/jira/browse/SPARK-20980
https://issues.apache.org/jira/browse/SPARK-20980
Hope this helps other folks looking for similar info.
希望这可以帮助其他人寻找类似的信息。
回答by RPaul
Another way to read JSON file using Java in Spark is similar to as mentioned above:
在 Spark 中使用 Java 读取 JSON 文件的另一种方法类似于上面提到的:
SparkSession spark = SparkSession.builder().appName("ProcessJSONData")
.master("local").getOrCreate();
String path = "C:/XX/XX/myData.json";
// Encoders are created for Java bean class
Encoder<FruitJson> fruitEncoder = Encoders.bean(FruitJson.class);
Dataset<FruitJson> fruitDS = spark.read().json(path).as(fruitEncoder);
fruitDS.show();