使用 Apache Spark 读取 JSON - `corrupt_record`

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38895057/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-03 18:33:01  来源:igfitidea点击:

Reading JSON with Apache Spark - `corrupt_record`

jsonscalaapache-spark

提问by LearningSlowly

I have a jsonfile, nodesthat looks like this:

我有一个json文件,nodes看起来像这样:

[{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1}
,{"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2}
,{"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3}
,{"toid":"osgb4000000031043208","point":[508513,196023],"index":4}]

I am able to read and manipulate this record with Python.

我能够使用 Python 读取和操作此记录。

I am trying to read this file in scalathrough the spark-shell.

我正在尝试scala通过spark-shell.

From this tutorial, I can see that it is possible to read jsonvia sqlContext.read.json

从这个教程中,我可以看到,它可以读取json通过sqlContext.read.json

val vfile = sqlContext.read.json("path/to/file/nodes.json")

However, this results in a corrupt_recorderror:

但是,这会导致corrupt_record错误:

vfile: org.apache.spark.sql.DataFrame = [_corrupt_record: string]

Can anyone shed some light on this error? I can read and use the file with other applications and I am confident it is not corrupt and sound json.

任何人都可以对这个错误有所了解吗?我可以读取该文件并将其与其他应用程序一起使用,并且我确信它没有损坏和健全json

回答by dk14

Spark cannot read JSON-array to a record on top-level, so you have to pass:

Spark 无法将 JSON 数组读取到顶级记录,因此您必须通过:

{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1} 
{"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2} 
{"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3} 
{"toid":"osgb4000000031043208","point":[508513,196023],"index":4}

As it's described in the tutorialyou're referring to:

正如您所指的教程所述

Let's begin by loading a JSON file, where each lineis a JSON object

让我们从加载一个 JSON 文件开始,其中每一行都是一个 JSON 对象

The reasoning is quite simple. Spark expects you to pass a file with a lot of JSON-entities (entity per line), so it could distribute their processing (per entity, roughly saying).

道理很简单。Spark 期望您传递一个包含大量 JSON 实体(每行一个实体)的文件,因此它可以分发它们的处理(每个实体,粗略地说)。

To put more light on it, here is a quote form the official doc

为了更清楚地了解它,这是官方文档的引用

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

请注意,作为 json 文件提供的文件不是典型的 JSON 文件。每行必须包含一个单独的、自包含的有效 JSON 对象。因此,常规的多行 JSON 文件通常会失败。

This format is called JSONL. Basically it's an alternative to CSV.

这种格式称为JSONL。基本上它是CSV的替代品。

回答by SandeepGodara

As Spark expects "JSON Line format" not a typical JSON format, we can tell spark to read typical JSON by specifying:

由于 Spark 期望“JSON 行格式”不是典型的 JSON 格式,我们可以通过指定来告诉 Spark 读取典型的 JSON:

val df = spark.read.option("multiline", "true").json("<file>")

回答by Datageek

To read the multi-line JSON as a DataFrame:

要将多行 JSON 作为 DataFrame 读取:

val spark = SparkSession.builder().getOrCreate()

val df = spark.read.json(spark.sparkContext.wholeTextFiles("file.json").values)

Reading large files in this manner is not recommended, from the wholeTextFiles docs

不推荐以这种方式读取大文件,来自WholeTextFiles 文档

Small files are preferred, large file is also allowable, but may cause bad performance.

小文件是首选,大文件也是允许的,但可能会导致性能不佳。

回答by Robert Gabriel

I run into the same problem. I used sparkContext and sparkSql on the same configuration:

我遇到了同样的问题。我在相同的配置中使用了 sparkContext 和 sparkSql:

val conf = new SparkConf()
  .setMaster("local[1]")
  .setAppName("Simple Application")


val sc = new SparkContext(conf)

val spark = SparkSession
  .builder()
  .config(conf)
  .getOrCreate()

Then, using the spark context I read the whole json (JSON - path to file) file:

然后,使用 spark 上下文我读取了整个 json(JSON - 文件路径)文件:

 val jsonRDD = sc.wholeTextFiles(JSON).map(x => x._2)

You can create a schema for future selects, filters...

您可以为将来的选择、过滤器创建架构...

val schema = StructType( List(
  StructField("toid", StringType, nullable = true),
  StructField("point", ArrayType(DoubleType), nullable = true),
  StructField("index", DoubleType, nullable = true)
))

Create a DataFrame using spark sql:

使用 spark sql 创建一个 DataFrame:

var df: DataFrame = spark.read.schema(schema).json(jsonRDD).toDF()

For testing use show and printSchema:

测试使用 show 和 printSchema:

df.show()
df.printSchema()

sbt build file:

sbt 构建文件:

name := "spark-single"

version := "1.0"

scalaVersion := "2.11.7"

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.2"
libraryDependencies +="org.apache.spark" %% "spark-sql" % "2.0.2"