从 scala 中的嵌套 json 文件创建一个 spark 数据框

Question

提问by devtest13

I have a json file that looks like this

我有一个看起来像这样的 json 文件

{
"group" : {},
"lang" : [ 
    [ 1, "scala", "functional" ], 
    [ 2, "java","object" ], 
    [ 3, "py","interpreted" ]
]
}

I tried to create a dataframe using

我尝试使用创建数据框

val path = "some/path/to/jsonFile.json"
val df = sqlContext.read.json(path)
df.show()

when I run this I get

当我运行这个时，我得到

df: org.apache.spark.sql.DataFrame = [_corrupt_record: string]

How do we create a df based on contents of "lang" key? I do not care about group{} all I need is, pull data out of "lang" and apply case class like this

我们如何根据“lang”键的内容创建一个df？我不关心 group{} 我只需要从“lang”中提取数据并像这样应用案例类

case class ProgLang (id: Int, lang: String, type: String )

I have read this post Reading JSON with Apache Spark - `corrupt_record`and understand that each record needs to be on a newline but in my case I cannot change the file structure

我已经阅读了这篇文章用 Apache Spark 读取 JSON - `corrupt_record`并了解每条记录都需要换行，但在我的情况下，我无法更改文件结构

Answer 1

回答by Ramesh Maharjan

The jsonformat is wrong. The the jsonapi of sqlContextis reading it as corrupt record. Correct form is

该json格式是错误的。的jsonapisqlContext正在将其读取为损坏的记录。正确的形式是

{"group":{},"lang":[[1,"scala","functional"],[2,"java","object"],[3,"py","interpreted"]]}

and supposing you have it in a file ("/home/test.json"), then you can use following method to get the dataframeyou want

假设你把它放在一个文件中（“/home/test.json”），那么你可以使用以下方法来获得dataframe你想要的

import org.apache.spark.sql.functions._
import sqlContext.implicits._

val df = sqlContext.read.json("/home/test.json")

val df2 = df.withColumn("lang", explode($"lang"))
    .withColumn("id", $"lang"(0))
    .withColumn("langs", $"lang"(1))
    .withColumn("type", $"lang"(2))
    .drop("lang")
    .withColumnRenamed("langs", "lang")
    .show(false)

You should have

你应该有

+---+-----+-----------+
|id |lang |type       |
+---+-----+-----------+
|1  |scala|functional |
|2  |java |object     |
|3  |py   |interpreted|
+---+-----+-----------+

Updated

更新

If you don't want to change your input json format as mentioned in your comment below, you can use wholeTextFilesto read the jsonfile and parseit as below

如果您不想更改下面评论中提到的输入 json 格式，您可以使用它wholeTextFiles来读取json文件，parse如下所示

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val readJSON = sc.wholeTextFiles("/home/test.json")
  .map(x => x._2)
  .map(data => data.replaceAll("\n", ""))

val df = sqlContext.read.json(readJSON)

val df2 = df.withColumn("lang", explode($"lang"))
  .withColumn("id", $"lang"(0).cast(IntegerType))
  .withColumn("langs", $"lang"(1))
  .withColumn("type", $"lang"(2))
  .drop("lang")
  .withColumnRenamed("langs", "lang")

df2.show(false)
df2.printSchema

It should give you dataframeas above and schemaas

它应该给你dataframe的上方，schema如

root
 |-- id: integer (nullable = true)
 |-- lang: string (nullable = true)
 |-- type: string (nullable = true)

Answer 2

回答by Jacek Laskowski

As of Spark 2.2you can use multiLineoption to deal with the case of multi-line JSONs.

从 Spark 2.2 开始，您可以使用multiLineoption 来处理多行 JSON 的情况。

scala> spark.read.option("multiLine", true).json("jsonFile.json").printSchema
root
 |-- lang: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)

Before Spark 2.2 see How to access sub-entities in JSON file?or Read multiline JSON in Apache Spark.

在 Spark 2.2 之前，请参阅如何访问 JSON 文件中的子实体？或在 Apache Spark 中读取多行 JSON。

从 scala 中的嵌套 json 文件创建一个 spark 数据框

提问by devtest13

回答by Ramesh Maharjan

回答by Jacek Laskowski

相关推荐

最近更新

标签

从 scala 中的嵌套 json 文件创建一个 spark 数据框

提问by devtest13

回答by Ramesh Maharjan

回答by Jacek Laskowski

相关推荐

scala 如何比较Scala中不同的两个数据框和打印列

如何使用 spark-submit（类似于 Python 脚本）运行 Scala 脚本？

将天数列添加到 Spark Scala 应用程序的同一数据框中的日期列

NoClassDefFoundError：scala/Product$class

相关推荐

最近更新

标签