将标准 python 键值字典列表转换为 pyspark 数据框
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37584077/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Convert a standard python key value dictionary list to pyspark data frame
提问by stackit
Consider i have a list of python dictionary key value pairs , where key correspond to column name of a table, so for below list how to convert it into a pyspark dataframe with two cols arg1 arg2?
考虑我有一个 python 字典键值对列表,其中键对应于表的列名,因此对于下面的列表,如何将其转换为带有两个 cols arg1 arg2 的 pyspark 数据框?
[{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}]
How can i use the following construct to do it?
我如何使用以下构造来做到这一点?
df = sc.parallelize([
...
]).toDF
Where to place arg1 arg2 in the above code (...)
上面代码中 arg1 arg2 的位置(...)
回答by 652bb3ca
Old way:
旧方式:
sc.parallelize([{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}]).toDF()
New way:
新方法:
from pyspark.sql import Row
from collections import OrderedDict
def convert_to_row(d: dict) -> Row:
return Row(**OrderedDict(sorted(d.items())))
sc.parallelize([{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}]) \
.map(convert_to_row) \
.toDF()
回答by Jeston
I had to modify the accepted answer in order for it to work for me in Python 2.7 running Spark 2.0.
我不得不修改接受的答案,以便它在运行 Spark 2.0 的 Python 2.7 中对我有用。
from collections import OrderedDict
from pyspark.sql import SparkSession, Row
spark = (SparkSession
.builder
.getOrCreate()
)
schema = StructType([
StructField('arg1', StringType(), True),
StructField('arg2', StringType(), True)
])
dta = [{"arg1": "", "arg2": ""}, {"arg1": "", "arg2": ""}]
dtaRDD = spark.sparkContext.parallelize(dta) \
.map(lambda x: Row(**OrderedDict(sorted(x.items()))))
dtaDF = spark.createDataFrame(dtaRdd, schema)
回答by gepant
For anyone looking for the solution to something different I found this worked for me: I have a single dictionary with key value pairs - I was looking to convert that to two PySpark dataframe columns:
对于任何寻找不同解决方案的人,我发现这对我有用:我有一个带有键值对的字典 - 我希望将其转换为两个 PySpark 数据框列:
So
所以
{k1:v1, k2:v2 ...}
Becomes
成为
----------------
| col1 | col2 |
|----------------|
| k1 | v1 |
| k2 | v2 |
----------------
lol= list(map(list, mydict.items()))
df = spark.createDataFrame(lol, ["col1", "col2"])