从 scala 中的嵌套 json 文件创建一个 spark 数据框

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45178338/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 09:22:18  来源:igfitidea点击:

create a spark dataframe from a nested json file in scala

scalaapache-sparkdataframenestedapache-spark-sql

提问by devtest13

I have a json file that looks like this

我有一个看起来像这样的 json 文件

{
"group" : {},
"lang" : [ 
    [ 1, "scala", "functional" ], 
    [ 2, "java","object" ], 
    [ 3, "py","interpreted" ]
]
}

I tried to create a dataframe using

我尝试使用创建数据框

val path = "some/path/to/jsonFile.json"
val df = sqlContext.read.json(path)
df.show()

when I run this I get

当我运行这个时,我得到

df: org.apache.spark.sql.DataFrame = [_corrupt_record: string]

How do we create a df based on contents of "lang" key? I do not care about group{} all I need is, pull data out of "lang" and apply case class like this

我们如何根据“lang”键的内容创建一个df?我不关心 group{} 我只需要从“lang”中提取数据并像这样应用案例类

case class ProgLang (id: Int, lang: String, type: String )

I have read this post Reading JSON with Apache Spark - `corrupt_record`and understand that each record needs to be on a newline but in my case I cannot change the file structure

我已经阅读了这篇文章用 Apache Spark 读取 JSON - `corrupt_record`并了解每条记录都需要换行,但在我的情况下,我无法更改文件结构

回答by Ramesh Maharjan

The jsonformat is wrong. The the jsonapi of sqlContextis reading it as corrupt record. Correct form is

json格式是错误的。的jsonapisqlContext正在将其读取为损坏的记录。正确的形式是

{"group":{},"lang":[[1,"scala","functional"],[2,"java","object"],[3,"py","interpreted"]]}

and supposing you have it in a file ("/home/test.json"), then you can use following method to get the dataframeyou want

假设你把它放在一个文件中(“/home/test.json”),那么你可以使用以下方法来获得dataframe你想要的

import org.apache.spark.sql.functions._
import sqlContext.implicits._

val df = sqlContext.read.json("/home/test.json")

val df2 = df.withColumn("lang", explode($"lang"))
    .withColumn("id", $"lang"(0))
    .withColumn("langs", $"lang"(1))
    .withColumn("type", $"lang"(2))
    .drop("lang")
    .withColumnRenamed("langs", "lang")
    .show(false)

You should have

你应该有

+---+-----+-----------+
|id |lang |type       |
+---+-----+-----------+
|1  |scala|functional |
|2  |java |object     |
|3  |py   |interpreted|
+---+-----+-----------+

Updated

更新

If you don't want to change your input json format as mentioned in your comment below, you can use wholeTextFilesto read the jsonfile and parseit as below

如果您不想更改下面评论中提到的输入 json 格式,您可以使用它wholeTextFiles来读取json文件,parse如下所示

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val readJSON = sc.wholeTextFiles("/home/test.json")
  .map(x => x._2)
  .map(data => data.replaceAll("\n", ""))

val df = sqlContext.read.json(readJSON)

val df2 = df.withColumn("lang", explode($"lang"))
  .withColumn("id", $"lang"(0).cast(IntegerType))
  .withColumn("langs", $"lang"(1))
  .withColumn("type", $"lang"(2))
  .drop("lang")
  .withColumnRenamed("langs", "lang")

df2.show(false)
df2.printSchema

It should give you dataframeas above and schemaas

它应该给你dataframe的上方,schema

root
 |-- id: integer (nullable = true)
 |-- lang: string (nullable = true)
 |-- type: string (nullable = true)

回答by Jacek Laskowski

As of Spark 2.2you can use multiLineoption to deal with the case of multi-line JSONs.

从 Spark 2.2 开始,您可以使用multiLineoption 来处理多行 JSON 的情况。

scala> spark.read.option("multiLine", true).json("jsonFile.json").printSchema
root
 |-- lang: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)

Before Spark 2.2 see How to access sub-entities in JSON file?or Read multiline JSON in Apache Spark.

在 Spark 2.2 之前,请参阅如何访问 JSON 文件中的子实体?在 Apache Spark 中读取多行 JSON