scala 如何使用scala读取spark中的json文件？

Question

提问by Ninja

I want to read the JSON file in the below format:-

我想以以下格式读取 JSON 文件：-

 {
  "titlename": "periodic",
    "atom": [
         {
          "usage": "neutron",
          "dailydata": [
    {
      "utcacquisitiontime": "2017-03-27T22:00:00Z",
      "datatimezone": "+02:00",
      "intervalvalue": 28128,
      "intervaltime": 15          
    },
    {
      "utcacquisitiontime": "2017-03-27T22:15:00Z",
      "datatimezone": "+02:00",
      "intervalvalue": 25687,
      "intervaltime": 15          
    }
   ]
  }
 ]
}

I am writing my read line as:

我将我的阅读行写为：

sqlContext.read.json("user/files_fold/testing-data.json").printSchema

But I not getting the desired result-

但我没有得到想要的结果-

root                                                                            
  |-- _corrupt_record: string (nullable = true)

Please help me on this

请帮我解决这个问题

Answer 1

回答by Ramesh Maharjan

I suggest using wholeTextFilesto read the file and apply some functions to convert it to a single-line JSON format.

我建议使用wholeTextFiles读取文件并应用一些函数将其转换为单行 JSON 格式。

val json = sc.wholeTextFiles("/user/files_fold/testing-data.json").
  map(tuple => tuple._2.replace("\n", "").trim)

val df = sqlContext.read.json(json)

You should have the final valid dataframeas

你应该有最后的有效dataframe作为

+--------------------------------------------------------------------------------------------------------+---------+
|atom                                                                                                    |titlename|
+--------------------------------------------------------------------------------------------------------+---------+
|[[WrappedArray([+02:00,15,28128,2017-03-27T22:00:00Z], [+02:00,15,25687,2017-03-27T22:15:00Z]),neutron]]|periodic |
+--------------------------------------------------------------------------------------------------------+---------+

And valid schemaas

并有效schema为

root
 |-- atom: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- dailydata: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- datatimezone: string (nullable = true)
 |    |    |    |    |-- intervaltime: long (nullable = true)
 |    |    |    |    |-- intervalvalue: long (nullable = true)
 |    |    |    |    |-- utcacquisitiontime: string (nullable = true)
 |    |    |-- usage: string (nullable = true)
 |-- titlename: string (nullable = true)

Answer 2

回答by Harshad_Pardeshi

Spark 2.2 introduced multiLine option which can be used to load JSON (not JSONL) files:

Spark 2.2 引入了 multiLine 选项，可用于加载 JSON（而非 JSONL）文件：

spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
  .json("/path/to/user.json")

Answer 3

回答by Andrei T.

It probably has something to do with the JSON object stored inside your file, could you print it or make sure it's the one you provided in the question? I'm asking because I took that one and it runs just fine:

它可能与存储在文件中的 JSON 对象有关，您能否打印它或确保它是您在问题中提供的对象？我问是因为我拿了那个，它运行得很好：

val json =
  """
    |{
    |  "titlename": "periodic",
    |  "atom": [
    |    {
    |      "usage": "neutron",
    |      "dailydata": [
    |        {
    |          "utcacquisitiontime": "2017-03-27T22:00:00Z",
    |          "datatimezone": "+02:00",
    |          "intervalvalue": 28128,
    |          "intervaltime": 15
    |        },
    |        {
    |          "utcacquisitiontime": "2017-03-27T22:15:00Z",
    |          "datatimezone": "+02:00",
    |          "intervalvalue": 25687,
    |          "intervaltime": 15
    |        }
    |      ]
    |    }
    |  ]
    |}
  """.stripMargin

val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.read
  .json(spark.sparkContext.parallelize(Seq(json)))
  .printSchema()

Answer 4

回答by philantrovert

From the Apache Spark SQL Docs

来自Apache Spark SQL 文档

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object.

请注意，作为 json 文件提供的文件不是典型的 JSON 文件。每行必须包含一个单独的、自包含的有效 JSON 对象。

Thus,

因此，

{ "titlename": "periodic","atom": [{ "usage": "neutron", "dailydata": [ {"utcacquisitiontime": "2017-03-27T22:00:00Z","datatimezone": "+02:00","intervalvalue": 28128,"intervaltime":15},{"utcacquisitiontime": "2017-03-27T22:15:00Z","datatimezone": "+02:00", "intervalvalue": 25687,"intervaltime": 15 }]}]}

And then:

接着：

val jsonDF = sqlContext.read.json("file")
jsonDF: org.apache.spark.sql.DataFrame = 
[atom: array<struct<dailydata:array<struct<datatimezone:string,intervaltime:bigint,intervalvalue:bigint,utcacquisitiontime:string>>,usage:string>>, 
titlename: string]

scala 如何使用scala读取spark中的json文件？

提问by Ninja

回答by Ramesh Maharjan

回答by Harshad_Pardeshi

回答by Andrei T.

回答by philantrovert

相关推荐

最近更新

标签

scala 如何使用scala读取spark中的json文件？

提问by Ninja

回答by Ramesh Maharjan

回答by Harshad_Pardeshi

回答by Andrei T.

回答by philantrovert

相关推荐

NoClassDefFoundError：scala/Product$class

scala Apache Spark 如何将列表/数组中的新列附加到 Spark 数据帧

Scala 子字符串函数

scala Spark dataframe写方法写很多小文件

相关推荐

最近更新

标签