Python Pyspark 数据框：对一列求和，同时对另一列进行分组

Question

提问by Paolo Lami

I have a dataframe such as the following

我有一个数据框，如下所示

In [94]: prova_df.show()


order_item_order_id order_item_subtotal
1                   299.98             
2                   199.99             
2                   250.0              
2                   129.99             
4                   49.98              
4                   299.95             
4                   150.0              
4                   199.92             
5                   299.98             
5                   299.95             
5                   99.96              
5                   299.98

What I would like to do is to compute, for each different value of the first column, the sum over the corresponding values of the second column. I've tried doing this with the following code:

我想做的是为第一列的每个不同值计算第二列相应值的总和。我尝试使用以下代码执行此操作：

from pyspark.sql import functions as func
prova_df.groupBy("order_item_order_id").agg(func.sum("order_item_subtotal")).show()

Which gives an output

这给出了一个输出

SUM('order_item_subtotal)
129.99000549316406       
579.9500122070312        
199.9499969482422        
634.819995880127         
434.91000747680664

Which I'm not so sure if it's doing the right thing. Why isn't it showing also the information from the first column? Thanks in advance for your answers

我不太确定它是否在做正确的事情。为什么它不显示第一列中的信息？预先感谢您的回答

Answer 1

回答by zero323

Why isn't it showing also the information from the first column?

为什么它不显示第一列中的信息？

Most likely because you're using outdated Spark 1.3.x. If thats the case you have to repeat grouping columns inside aggas follows:

很可能是因为您使用的是过时的 Spark 1.3.x。如果是这种情况，您必须agg按如下方式重复对列进行分组：

(df
    .groupBy("order_item_order_id")
    .agg(func.col("order_item_order_id"), func.sum("order_item_subtotal"))
    .show())

Answer 2

回答by luminousmen

You can use partitioning and window function for that:

您可以为此使用分区和窗口函数：

df.withColumn(value_field, f.sum("order_item_subtotal") \
  .over(Window.partitionBy("order_item_order_id"))) \
  .show()

Answer 3

回答by Zac Roberts

A similar solution for your problem using PySpark 2.7.xwould look like this:

使用PySpark 2.7.x解决您的问题的类似解决方案如下所示：

df = spark.createDataFrame(
    [(1, 299.98),
    (2, 199.99),
    (2, 250.0),
    (2, 129.99),
    (4, 49.98),
    (4, 299.95),
    (4, 150.0),
    (4, 199.92),
    (5, 299.98),
    (5, 299.95),
    (5, 99.96),
    (5, 299.98)],
    ['order_item_order_id', 'order_item_subtotal'])

df.groupBy('order_item_order_id').sum('order_item_subtotal').show()

Which results in the following output:

这导致以下输出：

+-------------------+------------------------+
|order_item_order_id|sum(order_item_subtotal)|
+-------------------+------------------------+
|                  5|       999.8700000000001|
|                  1|                  299.98|
|                  2|                  579.98|
|                  4|                  699.85|
+-------------------+------------------------+

Python Pyspark 数据框：对一列求和，同时对另一列进行分组

提问by Paolo Lami

回答by zero323

回答by luminousmen

回答by Zac Roberts

相关推荐

最近更新

标签

Python Pyspark 数据框：对一列求和，同时对另一列进行分组

提问by Paolo Lami

回答by zero323

回答by luminousmen

回答by Zac Roberts

相关推荐

Python seaborn 在子图中生成单独的数字

Python 在两个 Pandas 数据框中查找公共行（交集）

Python Pip 安装挂起

Python 如何正确创建实用程序类

相关推荐

最近更新

标签