scala 如何使用scala读取spark中的json文件?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45321924/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to read the json file in spark using scala?
提问by Ninja
I want to read the JSON file in the below format:-
我想以以下格式读取 JSON 文件:-
{
"titlename": "periodic",
"atom": [
{
"usage": "neutron",
"dailydata": [
{
"utcacquisitiontime": "2017-03-27T22:00:00Z",
"datatimezone": "+02:00",
"intervalvalue": 28128,
"intervaltime": 15
},
{
"utcacquisitiontime": "2017-03-27T22:15:00Z",
"datatimezone": "+02:00",
"intervalvalue": 25687,
"intervaltime": 15
}
]
}
]
}
I am writing my read line as:
我将我的阅读行写为:
sqlContext.read.json("user/files_fold/testing-data.json").printSchema
But I not getting the desired result-
但我没有得到想要的结果-
root
|-- _corrupt_record: string (nullable = true)
Please help me on this
请帮我解决这个问题
回答by Ramesh Maharjan
I suggest using wholeTextFilesto read the file and apply some functions to convert it to a single-line JSON format.
我建议使用wholeTextFiles读取文件并应用一些函数将其转换为单行 JSON 格式。
val json = sc.wholeTextFiles("/user/files_fold/testing-data.json").
map(tuple => tuple._2.replace("\n", "").trim)
val df = sqlContext.read.json(json)
You should have the final valid dataframeas
你应该有最后的有效dataframe作为
+--------------------------------------------------------------------------------------------------------+---------+
|atom |titlename|
+--------------------------------------------------------------------------------------------------------+---------+
|[[WrappedArray([+02:00,15,28128,2017-03-27T22:00:00Z], [+02:00,15,25687,2017-03-27T22:15:00Z]),neutron]]|periodic |
+--------------------------------------------------------------------------------------------------------+---------+
And valid schemaas
并有效schema为
root
|-- atom: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- dailydata: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- datatimezone: string (nullable = true)
| | | | |-- intervaltime: long (nullable = true)
| | | | |-- intervalvalue: long (nullable = true)
| | | | |-- utcacquisitiontime: string (nullable = true)
| | |-- usage: string (nullable = true)
|-- titlename: string (nullable = true)
回答by Harshad_Pardeshi
Spark 2.2 introduced multiLine option which can be used to load JSON (not JSONL) files:
Spark 2.2 引入了 multiLine 选项,可用于加载 JSON(而非 JSONL)文件:
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/path/to/user.json")
回答by Andrei T.
It probably has something to do with the JSON object stored inside your file, could you print it or make sure it's the one you provided in the question? I'm asking because I took that one and it runs just fine:
它可能与存储在文件中的 JSON 对象有关,您能否打印它或确保它是您在问题中提供的对象?我问是因为我拿了那个,它运行得很好:
val json =
"""
|{
| "titlename": "periodic",
| "atom": [
| {
| "usage": "neutron",
| "dailydata": [
| {
| "utcacquisitiontime": "2017-03-27T22:00:00Z",
| "datatimezone": "+02:00",
| "intervalvalue": 28128,
| "intervaltime": 15
| },
| {
| "utcacquisitiontime": "2017-03-27T22:15:00Z",
| "datatimezone": "+02:00",
| "intervalvalue": 25687,
| "intervaltime": 15
| }
| ]
| }
| ]
|}
""".stripMargin
val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.read
.json(spark.sparkContext.parallelize(Seq(json)))
.printSchema()
回答by philantrovert
From the Apache Spark SQL Docs
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object.
请注意,作为 json 文件提供的文件不是典型的 JSON 文件。每行必须包含一个单独的、自包含的有效 JSON 对象。
Thus,
因此,
{ "titlename": "periodic","atom": [{ "usage": "neutron", "dailydata": [ {"utcacquisitiontime": "2017-03-27T22:00:00Z","datatimezone": "+02:00","intervalvalue": 28128,"intervaltime":15},{"utcacquisitiontime": "2017-03-27T22:15:00Z","datatimezone": "+02:00", "intervalvalue": 25687,"intervaltime": 15 }]}]}
And then:
接着:
val jsonDF = sqlContext.read.json("file")
jsonDF: org.apache.spark.sql.DataFrame =
[atom: array<struct<dailydata:array<struct<datatimezone:string,intervaltime:bigint,intervalvalue:bigint,utcacquisitiontime:string>>,usage:string>>,
titlename: string]

