Python 将 JSON 文件读入 Spark 时出现 _corrupt_record 错误
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35409539/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
_corrupt_record error when reading a JSON file into Spark
提问by mar tin
I've got this JSON file
我有这个 JSON 文件
{
"a": 1,
"b": 2
}
which has been obtained with Python json.dump method. Now, I want to read this file into a DataFrame in Spark, using pyspark. Following documentation, I'm doing this
这是通过 Python json.dump 方法获得的。现在,我想使用 pyspark 将此文件读入 Spark 中的 DataFrame。根据文档,我正在这样做
sc = SparkContext()
sqlc = SQLContext(sc)
df = sqlc.read.json('my_file.json')
print df.show()
sc = SparkContext()
sqlc = SQLContext(sc)
df = sqlc.read.json('my_file.json')
打印 df.show()
The print statement spits out this though:
打印语句虽然吐出这一点:
+---------------+
|_corrupt_record|
+---------------+
| {|
| "a": 1, |
| "b": 2|
| }|
+---------------+
Anyone knows what's going on and why it is not interpreting the file correctly?
任何人都知道发生了什么以及为什么它没有正确解释文件?
采纳答案by Bernhard
You need to have one json object per row in your input file, see http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json
您需要在输入文件中的每一行有一个 json 对象,请参阅http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json
If your json file looks like this it will give you the expected dataframe:
如果您的 json 文件如下所示,它将为您提供预期的数据帧:
{ "a": 1, "b": 2 }
{ "a": 3, "b": 4 }
....
df.show()
+---+---+
| a| b|
+---+---+
| 1| 2|
| 3| 4|
+---+---+
回答by George Fisher
Adding to @Bernhard's great answer
添加到@Bernhard 的好答案
# original file was written with pretty-print inside a list
with open("pretty-printed.json") as jsonfile:
js = json.load(jsonfile)
# write a new file with one object per line
with open("flattened.json", 'a') as outfile:
for d in js:
json.dump(d, outfile)
outfile.write('\n')
回答by wiggy
If you want to leave your JSON file as it is (without stripping new lines characters \n
), include multiLine=True
keyword argument
如果您想保留 JSON 文件原样(不剥离换行符\n
),请包含multiLine=True
关键字参数
sc = SparkContext()
sqlc = SQLContext(sc)
df = sqlc.read.json('my_file.json', multiLine=True)
print df.show()
回答by Murtaza Zaveri
In Spark 2.2+ you can read json file of multiline using following command.
在 Spark 2.2+ 中,您可以使用以下命令读取多行的 json 文件。
val dataframe = spark.read.option("multiline",true).json( " filePath ")
if there is json object per line then,
如果每行有一个 json 对象,那么
val dataframe = spark.read.json(filepath)
回答by Vzzarr
I want to share my experience in which I have a JSON column String but with Python notation, which means I have None
instead of null
, False
instead of false
and True
instead of true
.
我想分享我的经验,我有一个 JSON 列 String 但使用 Python 表示法,这意味着我有None
而不是null
,False
而不是 ,false
而True
不是true
。
When parsing this column, spark returns me a column named _corrupt_record
. So what I had to do before parsing the JSON String is replacing the Python notation with the standard JSON notation:
解析此列时,spark 会返回一个名为_corrupt_record
. 所以在解析 JSON 字符串之前我必须做的是用标准的 JSON 表示法替换 Python 表示法:
df.withColumn("json_notation",
F.regexp_replace(F.regexp_replace(F.regexp_replace("_corrupt_record", "None", "null"), "False", "false") ,"True", "true")
After this transformation I was then able to use for example the function F.from_json()
on the json_notation
column and here Pyspark was able to correctly parse the JSON object.
在这个转换之后,我可以使用例如列F.from_json()
上的函数json_notation
,这里 Pyspark 能够正确解析 JSON 对象。