Python 将数据帧转换为 JSON(在 pyspark 中),然后选择所需的字段
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43232169/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Converting a dataframe into JSON (in pyspark) and then selecting desired fields
提问by xn139
I'm new to Spark. I have a dataframe that contains the results of some analysis. I converted that dataframe into JSON so I could display it in a Flask App:
我是 Spark 的新手。我有一个包含一些分析结果的数据框。我将该数据帧转换为 JSON,以便可以在 Flask 应用程序中显示它:
results = result.toJSON().collect()
An example entry in my json file is below. I then tried to run a for loop in order to get specific results:
我的 json 文件中的一个示例条目如下。然后我尝试运行 for 循环以获得特定结果:
{"userId":"1","systemId":"30","title":"interest"}
for i in results:
print i["userId"]
This doesn't work at all and I get errors such as: Python (json) : TypeError: expected string or buffer
这根本不起作用,我收到错误,例如: Python (json) : TypeError: expected string or buffer
I used json.dumps
and json.loads
and still nothing - I keep on getting errors such as string indices must be integers, as well as the above error.
我使用过json.dumps
,json.loads
但仍然没有 - 我不断收到错误,例如字符串索引必须是整数,以及上述错误。
I then tried this:
然后我尝试了这个:
print i[0]
This gave me the first character in the json "{" instead of the first line. I don't really know what to do, can anyone tell me where I'm going wrong?
这给了我 json 中的第一个字符“{”而不是第一行。我真的不知道该怎么办,谁能告诉我哪里出错了?
Many Thanks.
非常感谢。
回答by Allie Fitter
If the result of result.toJSON().collect()
is a JSON encoded string, then you would use json.loads()
to convert it to a dict
. The issue you're running into is that when you iterate a dict
with a for
loop, you're given the keys of the dict
. In your for
loop, you're treating the key as if it's a dict
, when in fact it is just a string
. Try this:
如果 的结果result.toJSON().collect()
是 JSON 编码的字符串,那么您可以使用json.loads()
将其转换为dict
. 您遇到的问题是,当您使用循环迭代 adict
时for
,您将获得dict
. 在您的for
循环中,您将键视为 a dict
,而实际上它只是 a string
。尝试这个:
# toJSON() turns each row of the DataFrame into a JSON string
# calling first() on the result will fetch the first row.
results = json.loads(result.toJSON().first())
for key in results:
print results[key]
# To decode the entire DataFrame iterate over the result
# of toJSON()
def print_rows(row):
data = json.loads(row)
for key in data:
print "{key}:{value}".format(key=key, value=data[key])
results = result.toJSON()
results.foreach(print_rows)
EDIT:The issue is that collect
returns a list
, not a dict
. I've updated the code. Always read the docs.
编辑:问题是collect
返回 a list
,而不是 a dict
。我已经更新了代码。始终阅读文档。
collect() Return a list that contains all of the elements in this RDD.
Note This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.
collect() 返回一个包含此 RDD 中所有元素的列表。
注意 只有在预期结果数组较小时才应使用此方法,因为所有数据都已加载到驱动程序的内存中。
EDIT2:I can't emphasize enough, always read the docs.
EDIT2:我再怎么强调也不为过,请务必阅读文档。
EDIT3:Look here.
EDIT3:看这里。
回答by Zeitgeist
Here is what worked for me:
这是对我有用的:
df_json = df.toJSON()
for row in df_json.collect():
#json string
print(row)
#json object
line = json.loads(row)
print(line[some_key])
Keep in mind that using .collect() is not advisable, since it collects the distributed data frames, and defeats the purpose of using data frames.
请记住,使用 .collect() 是不可取的,因为它收集分布式数据帧,并且违背了使用数据帧的目的。
回答by Bala
import json
>>> df = sqlContext.read.table("n1")
>>> df.show()
+-----+-------+----+---------------+-------+----+
| c1| c2| c3| c4| c5| c6|
+-----+-------+----+---------------+-------+----+
|00001|Content| 1|Content-article| |2018|
|00002|Content|null|Content-article|Content|2015|
+-----+-------+----+---------------+-------+----+
>>> results = df.toJSON().map(lambda j: json.loads(j)).collect()
>>> for i in results: print i["c1"], i["c6"]
...
00001 2018
00002 2015