Python 当 ID 匹配时，在其他 Pyspark 数据帧中逐列划分 Pyspark 数据帧

Question

提问by TrentWoodbury

I have a PySpark DataFrame, df1, that looks like:

我有一个 PySpark DataFrame，df1，看起来像：

CustomerID  CustomerValue
12          .17
14          .15
14          .25
17          .50
17          .01
17          .35

I have a second PySpark DataFrame, df2, that is df1 grouped by CustomerID and aggregated by the sum function. It looks like this:

我有第二个 PySpark DataFrame df2，即 df1 按 CustomerID 分组并按 sum 函数聚合。它看起来像这样：

 CustomerID  CustomerValueSum
 12          .17
 14          .40
 17          .86

I want to add a third column to df1 that is df1['CustomerValue'] divided by df2['CustomerValueSum'] for the same CustomerIDs. This would look like:

我想向 df1 添加第三列，即 df1['CustomerValue'] 除以 df2['CustomerValueSum'] 对于相同的 CustomerID。这看起来像：

CustomerID  CustomerValue  NormalizedCustomerValue
12          .17            1.00
14          .15            .38
14          .25            .62
17          .50            .58
17          .01            .01
17          .35            .41

In other words, I'm trying to convert this Python/Pandas code to PySpark:

换句话说，我正在尝试将此 Python/Pandas 代码转换为 PySpark：

normalized_list = []
for idx, row in df1.iterrows():
    (
        normalized_list
        .append(
            row.CustomerValue / df2[df2.CustomerID == row.CustomerID].CustomerValueSum
        )
    )
df1['NormalizedCustomerValue'] = [val.values[0] for val in normalized_list]

How can I do this?

我怎样才能做到这一点？

Answer 1

回答by dfernig

Code:

代码：

import pyspark.sql.functions as F

df1 = df1\
    .join(df2, "CustomerID")\
    .withColumn("NormalizedCustomerValue", (F.col("CustomerValue") / F.col("CustomerValueSum")))\
    .drop("CustomerValueSum")

Output:

输出：

df1.show()

+----------+-------------+-----------------------+
|CustomerID|CustomerValue|NormalizedCustomerValue|
+----------+-------------+-----------------------+
|        17|          0.5|     0.5813953488372093|
|        17|         0.01|   0.011627906976744186|
|        17|         0.35|     0.4069767441860465|
|        12|         0.17|                    1.0|
|        14|         0.15|    0.37499999999999994|
|        14|         0.25|                  0.625|
+----------+-------------+-----------------------+

Answer 2

回答by Abhishek Bansal

This can also be achieved using Spark Window function where you need not create separate dataframe with the aggregated values (df2):

这也可以使用 Spark Window 函数来实现，您无需使用聚合值 (df2) 创建单独的数据帧：

Creating the data for the input dataframe:

为输入数据框创建数据：

from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)

data =[(12, 0.17), (14, 0.15), (14, 0.25), (17, 0.5), (17, 0.01), (17, 0.35)]
df1 = sqlContext.createDataFrame(data, ['CustomerID', 'CustomerValue'])
df1.show()
+----------+-------------+
|CustomerID|CustomerValue|
+----------+-------------+
|        12|         0.17|
|        14|         0.15|
|        14|         0.25|
|        17|          0.5|
|        17|         0.01|
|        17|         0.35|
+----------+-------------+

Defining a Window partitioned by CustomerID:

定义按 CustomerID 分区的 Window：

from pyspark.sql import Window
from pyspark.sql.functions import sum

w = Window.partitionBy('CustomerID')

df2 = df1.withColumn('NormalizedCustomerValue', df1.CustomerValue/sum(df1.CustomerValue).over(w)).orderBy('CustomerID')

df2.show()
+----------+-------------+-----------------------+
|CustomerID|CustomerValue|NormalizedCustomerValue|
+----------+-------------+-----------------------+
|        12|         0.17|                    1.0|
|        14|         0.15|    0.37499999999999994|
|        14|         0.25|                  0.625|
|        17|          0.5|     0.5813953488372093|
|        17|         0.01|   0.011627906976744186|
|        17|         0.35|     0.4069767441860465|
+----------+-------------+-----------------------+

Python 当 ID 匹配时，在其他 Pyspark 数据帧中逐列划分 Pyspark 数据帧

提问by TrentWoodbury

回答by dfernig

回答by Abhishek Bansal

相关推荐

最近更新

标签

Python 当 ID 匹配时，在其他 Pyspark 数据帧中逐列划分 Pyspark 数据帧

提问by TrentWoodbury

回答by dfernig

回答by Abhishek Bansal

相关推荐

Python AttributeError: 'Ui_MainWindow' 对象没有属性 'setCentralWidget'

将值写入python中pandas工作表中的特定单元格

Python PIL：DLL 加载失败：找不到指定的过程

Python 检测项目是否是列表中的最后一个

相关推荐

最近更新

标签