java 使用 Apache Spark 读取 Json 文件

Question

提问by user6325753

I am trying to read Json file using Spark v2.0.0. In case of simple data code works really well. In case of little bit complex data, when i print df.show() the data is not showing in correct way.

我正在尝试使用 Spark v2.0.0 读取 Json 文件。在简单数据的情况下，代码工作得很好。如果数据有点复杂，当我打印 df.show() 时，数据没有以正确的方式显示。

here is my code:

这是我的代码：

SparkSession session = SparkSession.builder().master("local").appName("jsonreader").getOrCreate();
Dataset<Row> list = session.read().json("/Users/hadoop/Desktop/sample.json");
list.show();

Here is my sample data:

这是我的示例数据：

{
    "glossary": {
        "title": "example glossary",
        "GlossDiv": {
            "title": "S",
            "GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML"]
                    },
                    "GlossSee": "markup"
                }
            }
        }
    }
}

And my output is like:

我的输出是这样的：

+--------------------+
|     _corrupt_record|
+--------------------+
|                   {|
|       "glossary": {|
|        "title": ...|
|           "GlossDiv": {|
|            "titl...|
|               "GlossList": {|
|                "...|
|                 ...|
|                   "SortAs": "S...|
|                   "GlossTerm":...|
|                   "Acronym": "...|
|                   "Abbrev": "I...|
|                   "GlossDef": {|
|                 ...|
|                       "GlossSeeAl...|
|                 ...|
|                   "GlossSee": ...|
|                   }|
|                   }|
|                   }|
+--------------------+
only showing top 20 rows

Answer 1

回答by Ramachandran.A.G

You will need to format the JSON to one line if you have to read this JSON. This is a multi line JSON and hence is not being read and loaded properly (One Object one Row)

如果您必须阅读此 JSON，则需要将 JSON 格式化为一行。这是一个多行 JSON，因此没有被正确读取和加载（一个对象一行）

Quoting the JSON API :

引用 JSON API ：

Loads a JSON file (one object per line) and returns the result as a DataFrame.

加载一个 JSON 文件（每行一个对象）并将结果作为 DataFrame 返回。

{"glossary":{"title":"example glossary","GlossDiv":{"title":"S","GlossList":{"GlossEntry":{"ID":"SGML","SortAs":"SGML","GlossTerm":"Standard Generalized Markup Language","Acronym":"SGML","Abbrev":"ISO 8879:1986","GlossDef":{"para":"A meta-markup language, used to create markup languages such as DocBook.","GlossSeeAlso":["GML","XML"]},"GlossSee":"markup"}}}}}

I just tried it on the shell , it should work from the code as well the same way (I had the same corrupted record error when i read a multi line JSON)

我只是在 shell 上尝试过，它也应该以相同的方式从代码中工作（当我读取多行 JSON 时，我遇到了相同的损坏记录错误）

scala> val df = spark.read.json("C:/DevelopmentTools/data.json")
df: org.apache.spark.sql.DataFrame = [glossary: struct<GlossDiv: struct<GlossList: struct<GlossEntry: struct<Abbrev: string, Acronym: string ... 5 more fields>>, title: string>, title: string>]

scala>

Edits :

编辑：

You can get the values out from that data frame using any action , for example

例如，您可以使用任何操作从该数据框中获取值

scala> df.select(df("glossary.GlossDiv.GlossList.GlossEntry.GlossTerm")).show()
+--------------------+
|           GlossTerm|
+--------------------+
|Standard Generali...|
+--------------------+


scala>

You should be able to do it from your code as well

您也应该能够从您的代码中做到这一点

Answer 2

回答by Sandeep Purohit

Just make sure your json is in one line you are reading nested json so, if you already did this you are successfully loaded the json you are showing it in wrong way its nested json so you cant directly show, like if you want the title data of GlossDiv you can show it as follow

只需确保您的 json 在一行中，您正在阅读嵌套的 json，因此，如果您已经这样做了，则您已成功加载了 json，您以错误的方式显示它的嵌套 json，因此您无法直接显示，就像您想要标题数据一样的 GlossDiv 您可以将其显示如下

SparkSession session = SparkSession.builder().master("local").appName("jsonreader").getOrCreate();
Dataset<Row> list = session.read().json("/Users/hadoop/Desktop/sample.json");
list.select("glossary.GlossDiv.title") .show

Answer 3

回答by Sandeep Purohit

Try:

尝试：

session.read().json(session.sparkContext.wholeTextFiles("..."));

Answer 4

回答by skvyas

This thread is little old, I want to just elaborate on what @user6022341 has suggested. I ended up using it one of my projects:

这个线程有点旧，我想详细说明@user6022341 的建议。我最终使用它作为我的项目之一：

To process the multiline json file, wholeTextFiles(String path) transformation is the only solution in spark, if the file is one big json object. This transformation will load entire file content as a string. So, if in hdfs://a-hdfs-path directory you had two files namely, part-00000 and part-00001. Calling sparkContext.wholeTextFiles("hdfs://a-hdfs-path") will result in Spark returning a JavaPairRDD which has key as file name and value as the content of the file. This may not be the best solution and may hit performance for bigger files.

要处理多行 json 文件，如果文件是一个大 json 对象，则 WholeTextFiles(String path) 转换是 spark 中唯一的解决方案。此转换会将整个文件内容作为字符串加载。因此，如果在 hdfs://a-hdfs-path 目录中有两个文件，即 part-00000 和 part-00001。调用 sparkContext.wholeTextFiles("hdfs://a-hdfs-path") 将导致 Spark 返回一个 JavaPairRDD，它具有作为文件名的键和作为文件内容的值。这可能不是最佳解决方案，并且可能会影响较大文件的性能。

But if the multiline json file had multiple json objects split into multiple lines then you could probably use the hadoop.Configuration, some sample code is shown here. I haven't tested this out myself.

但是如果多行 json 文件有多个 json 对象分成多行，那么您可能可以使用 hadoop.Configuration，这里显示了一些示例代码。我自己没有测试过这个。

If you had to read a multiline csv file, you could do this with Spark 2.2

如果您必须读取多行 csv 文件，则可以使用 Spark 2.2 执行此操作

spark.read.csv(file, multiLine=True)

https://issues.apache.org/jira/browse/SPARK-19610

https://issues.apache.org/jira/browse/SPARK-20980

Hope this helps other folks looking for similar info.

希望这可以帮助其他人寻找类似的信息。

Answer 5

回答by RPaul

Another way to read JSON file using Java in Spark is similar to as mentioned above:

在 Spark 中使用 Java 读取 JSON 文件的另一种方法类似于上面提到的：

SparkSession spark = SparkSession.builder().appName("ProcessJSONData")
                        .master("local").getOrCreate();

String path = "C:/XX/XX/myData.json";

// Encoders are created for Java bean class
Encoder<FruitJson> fruitEncoder = Encoders.bean(FruitJson.class);

Dataset<FruitJson> fruitDS = spark.read().json(path).as(fruitEncoder);

fruitDS.show();

java 使用 Apache Spark 读取 Json 文件

提问by user6325753

回答by Ramachandran.A.G

回答by Sandeep Purohit

回答by Sandeep Purohit

回答by skvyas

回答by RPaul

相关推荐

最近更新

标签

java 使用 Apache Spark 读取 Json 文件

提问by user6325753

回答by Ramachandran.A.G

回答by Sandeep Purohit

回答by Sandeep Purohit

回答by skvyas

回答by RPaul

相关推荐

java.lang.Thread.State中90%线程分析：WAITING（停车）

java org.springframework.web.client.HttpClientErrorException: 400 null

java Java字节到字符的转换

Java String/Char charAt() 比较

相关推荐

最近更新

标签