scala 在spark中解析json
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/41455655/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Parsing json in spark
提问by baiduXiu
I was using json scala library to parse a json from a local drive in spark job :
我正在使用 json scala 库从 spark job 中的本地驱动器解析 json:
val requestJson=JSON.parseFull(Source.fromFile("c:/data/request.json").mkString)
    val mainJson=requestJson.get.asInstanceOf[Map[String,Any]].get("Request").get.asInstanceOf[Map[String,Any]]
    val currency=mainJson.get("currency").get.asInstanceOf[String]
But when i try to use the same parser by pointing to hdfs file location it doesnt work:
但是当我尝试通过指向 hdfs 文件位置来使用相同的解析器时,它不起作用:
val requestJson=JSON.parseFull(Source.fromFile("hdfs://url/user/request.json").mkString)
and gives me an error:
并给我一个错误:
java.io.FileNotFoundException: hdfs:/localhost/user/request.json (No such file or directory)
  at java.io.FileInputStream.open0(Native Method)
  at java.io.FileInputStream.open(FileInputStream.java:195)
  at java.io.FileInputStream.<init>(FileInputStream.java:138)
  at scala.io.Source$.fromFile(Source.scala:91)
  at scala.io.Source$.fromFile(Source.scala:76)
  at scala.io.Source$.fromFile(Source.scala:54)
  ... 128 elided
How can i use Json.parseFull library to get data from hdfs file location ?
如何使用 Json.parseFull 库从 hdfs 文件位置获取数据?
Thanks
谢谢
采纳答案by mrsrinivas
Spark does have an inbuilt support for JSON documents parsing which will be available in spark-sql_${scala.version}jar.  
Spark 确实具有对 JSON 文档解析的内置支持,该支持将在spark-sql_${scala.version}jar 中可用。  
In Spark 2.0+ :
在 Spark 2.0+ 中:
import org.apache.spark.sql.SparkSession 
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate
val df = spark.read.format("json").json("json/file/location/in/hdfs")
df.show()
with dfobject you can do all supported SQL operations on it and it's data processing will be distributedamong the nodes whereas requestJsonwill be computed in single machine only.
使用df对象,您可以对其执行所有支持的 SQL 操作,它的数据处理将分布在节点之间,而requestJson仅在单台机器上计算。
Maven dependencies
Maven 依赖项
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.11</artifactId>
    <version>2.0.0</version>
</dependency>
Edit:(as per comment to read file from hdfs)
val hdfs = org.apache.hadoop.fs.FileSystem.get( new java.net.URI("hdfs://ITS-Hadoop10:9000/"), new org.apache.hadoop.conf.Configuration() ) val path=new Path("/user/zhc/"+x+"/") val t=hdfs.listStatus(path) val in =hdfs.open(t(0).getPath) val reader = new BufferedReader(new InputStreamReader(in)) var l=reader.readLine()code credits: from another SO question
Maven dependencies:
<dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.7.2</version> <!-- you can change this as per your hadoop version --> </dependency>
编辑:(根据从 hdfs 读取文件的评论)
val hdfs = org.apache.hadoop.fs.FileSystem.get( new java.net.URI("hdfs://ITS-Hadoop10:9000/"), new org.apache.hadoop.conf.Configuration() ) val path=new Path("/user/zhc/"+x+"/") val t=hdfs.listStatus(path) val in =hdfs.open(t(0).getPath) val reader = new BufferedReader(new InputStreamReader(in)) var l=reader.readLine()Maven 依赖项:
<dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.7.2</version> <!-- you can change this as per your hadoop version --> </dependency>
回答by Madhu Kiran Seelam
It is much more easy in spark 2.0
在 spark 2.0 中更容易
val df = spark.read.json("json/file/location/in/hdfs")
df.show()
回答by Rahul Modak
One can use following in Spark to read the file from HDFS: val jsonText = sc.textFile("hdfs://url/user/request.json").collect.mkString("\n")
可以在 Spark 中使用以下命令从 HDFS 读取文件: val jsonText = sc.textFile("hdfs://url/user/request.json").collect.mkString("\n")

