scala 在spark中解析json

Question

提问by baiduXiu

I was using json scala library to parse a json from a local drive in spark job :

我正在使用 json scala 库从 spark job 中的本地驱动器解析 json：

val requestJson=JSON.parseFull(Source.fromFile("c:/data/request.json").mkString)
    val mainJson=requestJson.get.asInstanceOf[Map[String,Any]].get("Request").get.asInstanceOf[Map[String,Any]]
    val currency=mainJson.get("currency").get.asInstanceOf[String]

But when i try to use the same parser by pointing to hdfs file location it doesnt work:

但是当我尝试通过指向 hdfs 文件位置来使用相同的解析器时，它不起作用：

val requestJson=JSON.parseFull(Source.fromFile("hdfs://url/user/request.json").mkString)

and gives me an error:

并给我一个错误：

java.io.FileNotFoundException: hdfs:/localhost/user/request.json (No such file or directory)
  at java.io.FileInputStream.open0(Native Method)
  at java.io.FileInputStream.open(FileInputStream.java:195)
  at java.io.FileInputStream.<init>(FileInputStream.java:138)
  at scala.io.Source$.fromFile(Source.scala:91)
  at scala.io.Source$.fromFile(Source.scala:76)
  at scala.io.Source$.fromFile(Source.scala:54)
  ... 128 elided

How can i use Json.parseFull library to get data from hdfs file location ?

如何使用 Json.parseFull 库从 hdfs 文件位置获取数据？

Thanks

谢谢

Answer 1

采纳答案by mrsrinivas

Spark does have an inbuilt support for JSON documents parsing which will be available in spark-sql_${scala.version}jar.

Spark 确实具有对 JSON 文档解析的内置支持，该支持将在spark-sql_${scala.version}jar 中可用。

In Spark 2.0+ :

在 Spark 2.0+ 中：

import org.apache.spark.sql.SparkSession 

val spark: SparkSession = SparkSession.builder.master("local").getOrCreate

val df = spark.read.format("json").json("json/file/location/in/hdfs")

df.show()

with dfobject you can do all supported SQL operations on it and it's data processing will be distributedamong the nodes whereas requestJsonwill be computed in single machine only.

使用df对象，您可以对其执行所有支持的 SQL 操作，它的数据处理将分布在节点之间，而requestJson仅在单台机器上计算。

Maven dependencies

Maven 依赖项

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.11</artifactId>
    <version>2.0.0</version>
</dependency>

Edit:(as per comment to read file from hdfs)

val hdfs = org.apache.hadoop.fs.FileSystem.get(
             new java.net.URI("hdfs://ITS-Hadoop10:9000/"), 
             new org.apache.hadoop.conf.Configuration()
           )
val path=new Path("/user/zhc/"+x+"/")
val t=hdfs.listStatus(path)
val in =hdfs.open(t(0).getPath)
val reader = new BufferedReader(new InputStreamReader(in))
var l=reader.readLine()

code credits: from another SO question

Maven dependencies:

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-hdfs</artifactId>
    <version>2.7.2</version> <!-- you can change this as per your hadoop version -->
</dependency>

编辑：（根据从 hdfs 读取文件的评论）

val hdfs = org.apache.hadoop.fs.FileSystem.get(
             new java.net.URI("hdfs://ITS-Hadoop10:9000/"), 
             new org.apache.hadoop.conf.Configuration()
           )
val path=new Path("/user/zhc/"+x+"/")
val t=hdfs.listStatus(path)
val in =hdfs.open(t(0).getPath)
val reader = new BufferedReader(new InputStreamReader(in))
var l=reader.readLine()

代码学分：来自另一个 SO 问题

Maven 依赖项：

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-hdfs</artifactId>
    <version>2.7.2</version> <!-- you can change this as per your hadoop version -->
</dependency>

Answer 2

回答by Madhu Kiran Seelam

It is much more easy in spark 2.0

在 spark 2.0 中更容易

val df = spark.read.json("json/file/location/in/hdfs")
df.show()

Answer 3

回答by Rahul Modak

One can use following in Spark to read the file from HDFS: val jsonText = sc.textFile("hdfs://url/user/request.json").collect.mkString("\n")

可以在 Spark 中使用以下命令从 HDFS 读取文件： val jsonText = sc.textFile("hdfs://url/user/request.json").collect.mkString("\n")

scala 在spark中解析json

提问by baiduXiu

采纳答案by mrsrinivas

回答by Madhu Kiran Seelam

回答by Rahul Modak

相关推荐

最近更新

标签

scala 在spark中解析json

提问by baiduXiu

采纳答案by mrsrinivas

回答by Madhu Kiran Seelam

回答by Rahul Modak

相关推荐

scala Spark：在不聚合的情况下转置 DataFrame

scala 如何在Scala中对列表进行排序

scala Spark 案例类 - 十进制类型编码器错误“无法从十进制转换”

scala 读取的值不是 org.apache.spark.SparkContext 的成员

相关推荐

最近更新

标签