scala 在spark中解析json

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41455655/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 08:59:25  来源:igfitidea点击:

Parsing json in spark

scalaapache-sparkapache-spark-sqlapache-spark-2.0

提问by baiduXiu

I was using json scala library to parse a json from a local drive in spark job :

我正在使用 json scala 库从 spark job 中的本地驱动器解析 json:

val requestJson=JSON.parseFull(Source.fromFile("c:/data/request.json").mkString)
    val mainJson=requestJson.get.asInstanceOf[Map[String,Any]].get("Request").get.asInstanceOf[Map[String,Any]]
    val currency=mainJson.get("currency").get.asInstanceOf[String]

But when i try to use the same parser by pointing to hdfs file location it doesnt work:

但是当我尝试通过指向 hdfs 文件位置来使用相同的解析器时,它不起作用:

val requestJson=JSON.parseFull(Source.fromFile("hdfs://url/user/request.json").mkString)

and gives me an error:

并给我一个错误:

java.io.FileNotFoundException: hdfs:/localhost/user/request.json (No such file or directory)
  at java.io.FileInputStream.open0(Native Method)
  at java.io.FileInputStream.open(FileInputStream.java:195)
  at java.io.FileInputStream.<init>(FileInputStream.java:138)
  at scala.io.Source$.fromFile(Source.scala:91)
  at scala.io.Source$.fromFile(Source.scala:76)
  at scala.io.Source$.fromFile(Source.scala:54)
  ... 128 elided

How can i use Json.parseFull library to get data from hdfs file location ?

如何使用 Json.parseFull 库从 hdfs 文件位置获取数据?

Thanks

谢谢

采纳答案by mrsrinivas

Spark does have an inbuilt support for JSON documents parsing which will be available in spark-sql_${scala.version}jar.

Spark 确实具有对 JSON 文档解析的内置支持,该支持将在spark-sql_${scala.version}jar 中可用。

In Spark 2.0+ :

在 Spark 2.0+ 中:

import org.apache.spark.sql.SparkSession 

val spark: SparkSession = SparkSession.builder.master("local").getOrCreate

val df = spark.read.format("json").json("json/file/location/in/hdfs")

df.show()

with dfobject you can do all supported SQL operations on it and it's data processing will be distributedamong the nodes whereas requestJsonwill be computed in single machine only.

使用df对象,您可以对其执行所有支持的 SQL 操作,它的数据处理将分布在节点之间,而requestJson仅在单台机器上计算。

Maven dependencies

Maven 依赖项

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.11</artifactId>
    <version>2.0.0</version>
</dependency>


Edit:(as per comment to read file from hdfs)

val hdfs = org.apache.hadoop.fs.FileSystem.get(
             new java.net.URI("hdfs://ITS-Hadoop10:9000/"), 
             new org.apache.hadoop.conf.Configuration()
           )
val path=new Path("/user/zhc/"+x+"/")
val t=hdfs.listStatus(path)
val in =hdfs.open(t(0).getPath)
val reader = new BufferedReader(new InputStreamReader(in))
var l=reader.readLine()

code credits: from another SO question

Maven dependencies:

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-hdfs</artifactId>
    <version>2.7.2</version> <!-- you can change this as per your hadoop version -->
</dependency>

编辑:(根据从 hdfs 读取文件的评论)

val hdfs = org.apache.hadoop.fs.FileSystem.get(
             new java.net.URI("hdfs://ITS-Hadoop10:9000/"), 
             new org.apache.hadoop.conf.Configuration()
           )
val path=new Path("/user/zhc/"+x+"/")
val t=hdfs.listStatus(path)
val in =hdfs.open(t(0).getPath)
val reader = new BufferedReader(new InputStreamReader(in))
var l=reader.readLine()

代码学分:来自另一个 SO 问题

Maven 依赖项:

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-hdfs</artifactId>
    <version>2.7.2</version> <!-- you can change this as per your hadoop version -->
</dependency>

回答by Madhu Kiran Seelam

It is much more easy in spark 2.0

在 spark 2.0 中更容易

val df = spark.read.json("json/file/location/in/hdfs")
df.show()

回答by Rahul Modak

One can use following in Spark to read the file from HDFS: val jsonText = sc.textFile("hdfs://url/user/request.json").collect.mkString("\n")

可以在 Spark 中使用以下命令从 HDFS 读取文件: val jsonText = sc.textFile("hdfs://url/user/request.json").collect.mkString("\n")