Python PySpark 使用来自字典的映射创建新列

Question

提问by ad_s

Using Spark 1.6, I have a Spark DataFrame column(named let's say col1) with values A, B, C, DS, DNS, E, F, G and H and I want to create a new column (say col2) with the values from the dicthere below, how do I map this? (so f.i. 'A' needs to be mapped to 'S' etc..)

使用 Spark 1.6，我有一个带有值 A、B、C、DS、DNS、E、F、G 和 H的 Spark DataFrame column（假设命名为col1），我想col2使用dict下面的值创建一个新列（比如说），我该如何映射？（所以 fi 'A' 需要映射到 'S' 等等。）

dict = {'A': 'S', 'B': 'S', 'C': 'S', 'DS': 'S', 'DNS': 'S', 'E': 'NS', 'F': 'NS', 'G': 'NS', 'H': 'NS'}

Answer 1

回答by zero323

Inefficient solution with UDF (version independent):

UDF 的低效解决方案（与版本无关）：

from pyspark.sql.types import StringType
from pyspark.sql.functions import udf

def translate(mapping):
    def translate_(col):
        return mapping.get(col)
    return udf(translate_, StringType())

df = sc.parallelize([('DS', ), ('G', ), ('INVALID', )]).toDF(['key'])
mapping = {
    'A': 'S', 'B': 'S', 'C': 'S', 'DS': 'S', 'DNS': 'S', 
    'E': 'NS', 'F': 'NS', 'G': 'NS', 'H': 'NS'}

df.withColumn("value", translate(mapping)("key"))

with the result:

结果：

+-------+-----+
|    key|value|
+-------+-----+
|     DS|    S|
|      G|   NS|
|INVALID| null|
+-------+-----+

Much more efficient (Spark >= 2.0, Spark < 3.0) is to create a MapTypeliteral:

更高效（Spark >= 2.0, Spark < 3.0）是创建一个MapType文字：

from pyspark.sql.functions import col, create_map, lit
from itertools import chain

mapping_expr = create_map([lit(x) for x in chain(*mapping.items())])

df.withColumn("value", mapping_expr.getItem(col("key")))

with the same result:

结果相同：

+-------+-----+
|    key|value|
+-------+-----+
|     DS|    S|
|      G|   NS|
|INVALID| null|
+-------+-----+

but more efficient execution plan:

但更有效的执行计划：

== Physical Plan ==
*Project [key#15, keys: [B,DNS,DS,F,E,H,C,G,A], values: [S,S,S,NS,NS,NS,S,NS,S][key#15] AS value#53]
+- Scan ExistingRDD[key#15]

compared to UDF version:

与UDF版本相比：

== Physical Plan ==
*Project [key#15, pythonUDF0#61 AS value#57]
+- BatchEvalPython [translate_(key#15)], [key#15, pythonUDF0#61]
   +- Scan ExistingRDD[key#15]

In Spark >= 3.0getItemshould be replaced with __getitem__([]), i.e:

在Spark >= 3.0 中getItem应替换为__getitem__( [])，即：

df.withColumn("value", mapping_expr[col("key")]).show()

Answer 2

回答by Haim Bendanan

Sounds like the simplest solution would be to use the replace function: http://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace

听起来最简单的解决方案是使用替换功能：http: //spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace

mapping= {
        'A': '1',
        'B': '2'
    }
df2 = df.replace(to_replace=mapping, subset=['yourColName'])

Python PySpark 使用来自字典的映射创建新列

提问by ad_s

回答by zero323

回答by Haim Bendanan

相关推荐

最近更新

标签

Python PySpark 使用来自字典的映射创建新列

提问by ad_s

回答by zero323

回答by Haim Bendanan

相关推荐

Python 类型错误：zip 参数 #2 必须支持迭代

Python 凯拉斯 | 类型错误：__init__() 缺少 1 个必需的位置参数：'nb_col'

Python Gunicorn，没有名为“myproject”的模块

Python ValueError：无法将大小为 30470400 的数组重塑为形状 (50,1104,104)

相关推荐

最近更新

标签

Python 凯拉斯 | 类型错误：init() 缺少 1 个必需的位置参数：'nb_col'