将标准 python 键值字典列表转换为 pyspark 数据框

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37584077/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 19:35:09  来源:igfitidea点击:

Convert a standard python key value dictionary list to pyspark data frame

pythondictionaryapache-sparkpyspark

提问by stackit

Consider i have a list of python dictionary key value pairs , where key correspond to column name of a table, so for below list how to convert it into a pyspark dataframe with two cols arg1 arg2?

考虑我有一个 python 字典键值对列表,其中键对应于表的列名,因此对于下面的列表,如何将其转换为带有两个 cols arg1 arg2 的 pyspark 数据框?

 [{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}]

How can i use the following construct to do it?

我如何使用以下构造来做到这一点?

df = sc.parallelize([
    ...
]).toDF

Where to place arg1 arg2 in the above code (...)

上面代码中 arg1 arg2 的位置(...)

回答by 652bb3ca

Old way:

旧方式:

sc.parallelize([{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}]).toDF()

New way:

新方法:

from pyspark.sql import Row
from collections import OrderedDict

def convert_to_row(d: dict) -> Row:
    return Row(**OrderedDict(sorted(d.items())))

sc.parallelize([{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}]) \
    .map(convert_to_row) \ 
    .toDF()

回答by Jeston

I had to modify the accepted answer in order for it to work for me in Python 2.7 running Spark 2.0.

我不得不修改接受的答案,以便它在运行 Spark 2.0 的 Python 2.7 中对我有用。

from collections import OrderedDict
from pyspark.sql import SparkSession, Row

spark = (SparkSession
        .builder
        .getOrCreate()
    )

schema = StructType([
    StructField('arg1', StringType(), True),
    StructField('arg2', StringType(), True)
])

dta = [{"arg1": "", "arg2": ""}, {"arg1": "", "arg2": ""}]

dtaRDD = spark.sparkContext.parallelize(dta) \
    .map(lambda x: Row(**OrderedDict(sorted(x.items()))))

dtaDF = spark.createDataFrame(dtaRdd, schema) 

回答by gepant

For anyone looking for the solution to something different I found this worked for me: I have a single dictionary with key value pairs - I was looking to convert that to two PySpark dataframe columns:

对于任何寻找不同解决方案的人,我发现这对我有用:我有一个带有键值对的字典 - 我希望将其转换为两个 PySpark 数据框列:

So

所以

{k1:v1, k2:v2 ...}

Becomes

成为

 ---------------- 
| col1   |  col2 |
|----------------|
| k1     |  v1   |
| k2     |  v2   |
 ----------------

lol= list(map(list, mydict.items()))
df = spark.createDataFrame(lol, ["col1", "col2"])