Python 将数据帧转换为 JSON（在 pyspark 中），然后选择所需的字段

Question

提问by xn139

I'm new to Spark. I have a dataframe that contains the results of some analysis. I converted that dataframe into JSON so I could display it in a Flask App:

我是 Spark 的新手。我有一个包含一些分析结果的数据框。我将该数据帧转换为 JSON，以便可以在 Flask 应用程序中显示它：

results = result.toJSON().collect()

An example entry in my json file is below. I then tried to run a for loop in order to get specific results:

我的 json 文件中的一个示例条目如下。然后我尝试运行 for 循环以获得特定结果：

{"userId":"1","systemId":"30","title":"interest"}

for i in results:
    print i["userId"]

This doesn't work at all and I get errors such as: Python (json) : TypeError: expected string or buffer

这根本不起作用，我收到错误，例如： Python (json) : TypeError: expected string or buffer

I used json.dumpsand json.loadsand still nothing - I keep on getting errors such as string indices must be integers, as well as the above error.

我使用过json.dumps，json.loads但仍然没有 - 我不断收到错误，例如字符串索引必须是整数，以及上述错误。

I then tried this:

然后我尝试了这个：

  print i[0]

This gave me the first character in the json "{" instead of the first line. I don't really know what to do, can anyone tell me where I'm going wrong?

这给了我 json 中的第一个字符“{”而不是第一行。我真的不知道该怎么办，谁能告诉我哪里出错了？

Many Thanks.

非常感谢。

Answer 1

回答by Allie Fitter

If the result of result.toJSON().collect()is a JSON encoded string, then you would use json.loads()to convert it to a dict. The issue you're running into is that when you iterate a dictwith a forloop, you're given the keys of the dict. In your forloop, you're treating the key as if it's a dict, when in fact it is just a string. Try this:

如果的结果result.toJSON().collect()是 JSON 编码的字符串，那么您可以使用json.loads()将其转换为dict. 您遇到的问题是，当您使用循环迭代 adict时for，您将获得dict. 在您的for循环中，您将键视为 a dict，而实际上它只是 a string。尝试这个：

# toJSON() turns each row of the DataFrame into a JSON string
# calling first() on the result will fetch the first row.
results = json.loads(result.toJSON().first())

for key in results:
    print results[key]

# To decode the entire DataFrame iterate over the result
# of toJSON()

def print_rows(row):
    data = json.loads(row)
    for key in data:
        print "{key}:{value}".format(key=key, value=data[key])


results = result.toJSON()
results.foreach(print_rows)

EDIT:The issue is that collectreturns a list, not a dict. I've updated the code. Always read the docs.

编辑：问题是collect返回 a list，而不是 a dict。我已经更新了代码。始终阅读文档。

collect() Return a list that contains all of the elements in this RDD.
Note This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.

collect() 返回一个包含此 RDD 中所有元素的列表。
注意只有在预期结果数组较小时才应使用此方法，因为所有数据都已加载到驱动程序的内存中。

EDIT2:I can't emphasize enough, always read the docs.

EDIT2：我再怎么强调也不为过，请务必阅读文档。

EDIT3:Look here.

EDIT3：看这里。

Answer 2

回答by Zeitgeist

Here is what worked for me:

这是对我有用的：

df_json = df.toJSON()

for row in df_json.collect():
    #json string
    print(row) 

    #json object
    line = json.loads(row) 
    print(line[some_key])

Keep in mind that using .collect() is not advisable, since it collects the distributed data frames, and defeats the purpose of using data frames.

请记住，使用 .collect() 是不可取的，因为它收集分布式数据帧，并且违背了使用数据帧的目的。

Answer 3

回答by Bala

import json
>>> df = sqlContext.read.table("n1")
>>> df.show()
+-----+-------+----+---------------+-------+----+
|   c1|     c2|  c3|             c4|     c5|  c6|
+-----+-------+----+---------------+-------+----+
|00001|Content|   1|Content-article|       |2018|
|00002|Content|null|Content-article|Content|2015|
+-----+-------+----+---------------+-------+----+

>>> results = df.toJSON().map(lambda j: json.loads(j)).collect()
>>> for i in results: print i["c1"], i["c6"]
... 
00001 2018
00002 2015

Python 将数据帧转换为 JSON（在 pyspark 中），然后选择所需的字段

提问by xn139

回答by Allie Fitter

回答by Zeitgeist

回答by Bala

相关推荐

最近更新

标签

Python 将数据帧转换为 JSON（在 pyspark 中），然后选择所需的字段

提问by xn139

回答by Allie Fitter

回答by Zeitgeist

回答by Bala

相关推荐

Python 熊猫系列得到“数据必须是一维的”错误

Python json.load() 和 json.loads() 函数有什么区别

Python 如何在 Pandas 的数据框中获取行号？

Python 将 Pandas 数据帧转换为 Dask 数据帧

相关推荐

最近更新

标签