Python 将 JSON 文件读入 Spark 时出现 _corrupt_record 错误

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35409539/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 16:24:37  来源:igfitidea点击:

_corrupt_record error when reading a JSON file into Spark

pythonjsondataframepyspark

提问by mar tin

I've got this JSON file

我有这个 JSON 文件

{
    "a": 1, 
    "b": 2
}

which has been obtained with Python json.dump method. Now, I want to read this file into a DataFrame in Spark, using pyspark. Following documentation, I'm doing this

这是通过 Python json.dump 方法获得的。现在,我想使用 pyspark 将此文件读入 Spark 中的 DataFrame。根据文档,我正在这样做

sc = SparkContext()

sqlc = SQLContext(sc)

df = sqlc.read.json('my_file.json')

print df.show()

sc = SparkContext()

sqlc = SQLContext(sc)

df = sqlc.read.json('my_file.json')

打印 df.show()

The print statement spits out this though:

打印语句虽然吐出这一点:

+---------------+
|_corrupt_record|
+---------------+
|              {|
|       "a": 1, |
|         "b": 2|
|              }|
+---------------+

Anyone knows what's going on and why it is not interpreting the file correctly?

任何人都知道发生了什么以及为什么它没有正确解释文件?

采纳答案by Bernhard

You need to have one json object per row in your input file, see http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json

您需要在输入文件中的每一行有一个 json 对象,请参阅http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json

If your json file looks like this it will give you the expected dataframe:

如果您的 json 文件如下所示,它将为您提供预期的数据帧:

{ "a": 1, "b": 2 }
{ "a": 3, "b": 4 }

....
df.show()
+---+---+
|  a|  b|
+---+---+
|  1|  2|
|  3|  4|
+---+---+

回答by George Fisher

Adding to @Bernhard's great answer

添加到@Bernhard 的好答案

# original file was written with pretty-print inside a list
with open("pretty-printed.json") as jsonfile:
    js = json.load(jsonfile)      

# write a new file with one object per line
with open("flattened.json", 'a') as outfile:
    for d in js:
        json.dump(d, outfile)
        outfile.write('\n')

回答by wiggy

If you want to leave your JSON file as it is (without stripping new lines characters \n), include multiLine=Truekeyword argument

如果您想保留 JSON 文件原样(不剥离换行符\n),请包含multiLine=True关键字参数

sc = SparkContext() 
sqlc = SQLContext(sc)

df = sqlc.read.json('my_file.json', multiLine=True)

print df.show()

回答by Murtaza Zaveri

In Spark 2.2+ you can read json file of multiline using following command.

在 Spark 2.2+ 中,您可以使用以下命令读取多行的 json 文件。

val dataframe = spark.read.option("multiline",true).json( " filePath ")

if there is json object per line then,

如果每行有一个 json 对象,那么

val dataframe = spark.read.json(filepath)

回答by Vzzarr

I want to share my experience in which I have a JSON column String but with Python notation, which means I have Noneinstead of null, Falseinstead of falseand Trueinstead of true.

我想分享我的经验,我有一个 JSON 列 String 但使用 Python 表示法,这意味着我有None而不是nullFalse而不是 ,falseTrue不是true

When parsing this column, spark returns me a column named _corrupt_record. So what I had to do before parsing the JSON String is replacing the Python notation with the standard JSON notation:

解析此列时,spark 会返回一个名为_corrupt_record. 所以在解析 JSON 字符串之前我必须做的是用标准的 JSON 表示法替换 Python 表示法:

df.withColumn("json_notation",
    F.regexp_replace(F.regexp_replace(F.regexp_replace("_corrupt_record", "None", "null"), "False", "false") ,"True", "true")

After this transformation I was then able to use for example the function F.from_json()on the json_notationcolumn and here Pyspark was able to correctly parse the JSON object.

在这个转换之后,我可以使用例如列F.from_json()上的函数json_notation,这里 Pyspark 能够正确解析 JSON 对象。