将标准 python 键值字典列表转换为 pyspark 数据框

Question

提问by stackit

Consider i have a list of python dictionary key value pairs , where key correspond to column name of a table, so for below list how to convert it into a pyspark dataframe with two cols arg1 arg2?

考虑我有一个 python 字典键值对列表，其中键对应于表的列名，因此对于下面的列表，如何将其转换为带有两个 cols arg1 arg2 的 pyspark 数据框？

 [{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}]

How can i use the following construct to do it?

我如何使用以下构造来做到这一点？

df = sc.parallelize([
    ...
]).toDF

Where to place arg1 arg2 in the above code (...)

上面代码中 arg1 arg2 的位置（...）

Answer 1

回答by 652bb3ca

Old way:

旧方式：

sc.parallelize([{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}]).toDF()

New way:

新方法：

from pyspark.sql import Row
from collections import OrderedDict

def convert_to_row(d: dict) -> Row:
    return Row(**OrderedDict(sorted(d.items())))

sc.parallelize([{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}]) \
    .map(convert_to_row) \ 
    .toDF()

Answer 2

回答by Jeston

I had to modify the accepted answer in order for it to work for me in Python 2.7 running Spark 2.0.

我不得不修改接受的答案，以便它在运行 Spark 2.0 的 Python 2.7 中对我有用。

from collections import OrderedDict
from pyspark.sql import SparkSession, Row

spark = (SparkSession
        .builder
        .getOrCreate()
    )

schema = StructType([
    StructField('arg1', StringType(), True),
    StructField('arg2', StringType(), True)
])

dta = [{"arg1": "", "arg2": ""}, {"arg1": "", "arg2": ""}]

dtaRDD = spark.sparkContext.parallelize(dta) \
    .map(lambda x: Row(**OrderedDict(sorted(x.items()))))

dtaDF = spark.createDataFrame(dtaRdd, schema)

Answer 3

回答by gepant

For anyone looking for the solution to something different I found this worked for me: I have a single dictionary with key value pairs - I was looking to convert that to two PySpark dataframe columns:

对于任何寻找不同解决方案的人，我发现这对我有用：我有一个带有键值对的字典 - 我希望将其转换为两个 PySpark 数据框列：

So

所以

{k1:v1, k2:v2 ...}

Becomes

成为

 ---------------- 
| col1   |  col2 |
|----------------|
| k1     |  v1   |
| k2     |  v2   |
 ----------------

lol= list(map(list, mydict.items()))
df = spark.createDataFrame(lol, ["col1", "col2"])

将标准 python 键值字典列表转换为 pyspark 数据框

提问by stackit

回答by 652bb3ca

回答by Jeston

回答by gepant

相关推荐

最近更新

标签

将标准 python 键值字典列表转换为 pyspark 数据框

提问by stackit

回答by 652bb3ca

回答by Jeston

回答by gepant

相关推荐

Python pandas NameError: StringIO 未定义

Python 如何将自定义数据集拆分为训练数据集和测试数据集？

Python 如何在 Mac 上安装 numpy

Python OSError：在 Pandas 的 csv 上从文件初始化失败

相关推荐

最近更新

标签