从 scala 中的嵌套 json 文件创建一个 spark 数据框
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45178338/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
create a spark dataframe from a nested json file in scala
提问by devtest13
I have a json file that looks like this
我有一个看起来像这样的 json 文件
{
"group" : {},
"lang" : [
[ 1, "scala", "functional" ],
[ 2, "java","object" ],
[ 3, "py","interpreted" ]
]
}
I tried to create a dataframe using
我尝试使用创建数据框
val path = "some/path/to/jsonFile.json"
val df = sqlContext.read.json(path)
df.show()
when I run this I get
当我运行这个时,我得到
df: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
How do we create a df based on contents of "lang" key? I do not care about group{} all I need is, pull data out of "lang" and apply case class like this
我们如何根据“lang”键的内容创建一个df?我不关心 group{} 我只需要从“lang”中提取数据并像这样应用案例类
case class ProgLang (id: Int, lang: String, type: String )
I have read this post Reading JSON with Apache Spark - `corrupt_record`and understand that each record needs to be on a newline but in my case I cannot change the file structure
我已经阅读了这篇文章用 Apache Spark 读取 JSON - `corrupt_record`并了解每条记录都需要换行,但在我的情况下,我无法更改文件结构
回答by Ramesh Maharjan
The jsonformat is wrong. The the jsonapi of sqlContextis reading it as corrupt record. Correct form is
该json格式是错误的。的jsonapisqlContext正在将其读取为损坏的记录。正确的形式是
{"group":{},"lang":[[1,"scala","functional"],[2,"java","object"],[3,"py","interpreted"]]}
and supposing you have it in a file ("/home/test.json"), then you can use following method to get the dataframeyou want
假设你把它放在一个文件中(“/home/test.json”),那么你可以使用以下方法来获得dataframe你想要的
import org.apache.spark.sql.functions._
import sqlContext.implicits._
val df = sqlContext.read.json("/home/test.json")
val df2 = df.withColumn("lang", explode($"lang"))
.withColumn("id", $"lang"(0))
.withColumn("langs", $"lang"(1))
.withColumn("type", $"lang"(2))
.drop("lang")
.withColumnRenamed("langs", "lang")
.show(false)
You should have
你应该有
+---+-----+-----------+
|id |lang |type |
+---+-----+-----------+
|1 |scala|functional |
|2 |java |object |
|3 |py |interpreted|
+---+-----+-----------+
Updated
更新
If you don't want to change your input json format as mentioned in your comment below, you can use wholeTextFilesto read the jsonfile and parseit as below
如果您不想更改下面评论中提到的输入 json 格式,您可以使用它wholeTextFiles来读取json文件,parse如下所示
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val readJSON = sc.wholeTextFiles("/home/test.json")
.map(x => x._2)
.map(data => data.replaceAll("\n", ""))
val df = sqlContext.read.json(readJSON)
val df2 = df.withColumn("lang", explode($"lang"))
.withColumn("id", $"lang"(0).cast(IntegerType))
.withColumn("langs", $"lang"(1))
.withColumn("type", $"lang"(2))
.drop("lang")
.withColumnRenamed("langs", "lang")
df2.show(false)
df2.printSchema
It should give you dataframeas above and schemaas
它应该给你dataframe的上方,schema如
root
|-- id: integer (nullable = true)
|-- lang: string (nullable = true)
|-- type: string (nullable = true)
回答by Jacek Laskowski
As of Spark 2.2you can use multiLineoption to deal with the case of multi-line JSONs.
从 Spark 2.2 开始,您可以使用multiLineoption 来处理多行 JSON 的情况。
scala> spark.read.option("multiLine", true).json("jsonFile.json").printSchema
root
|-- lang: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
Before Spark 2.2 see How to access sub-entities in JSON file?or Read multiline JSON in Apache Spark.
在 Spark 2.2 之前,请参阅如何访问 JSON 文件中的子实体?或在 Apache Spark 中读取多行 JSON。

