Python Pyspark 数据框:对一列求和,同时对另一列进行分组

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33961899/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 14:15:36  来源:igfitidea点击:

Pyspark dataframe: Summing over a column while grouping over another

pythonapache-spark-sqlpysparkpyspark-sqlapache-spark-1.3

提问by Paolo Lami

I have a dataframe such as the following

我有一个数据框,如下所示

In [94]: prova_df.show()


order_item_order_id order_item_subtotal
1                   299.98             
2                   199.99             
2                   250.0              
2                   129.99             
4                   49.98              
4                   299.95             
4                   150.0              
4                   199.92             
5                   299.98             
5                   299.95             
5                   99.96              
5                   299.98             

What I would like to do is to compute, for each different value of the first column, the sum over the corresponding values of the second column. I've tried doing this with the following code:

我想做的是为第一列的每个不同值计算第二列相应值的总和。我尝试使用以下代码执行此操作:

from pyspark.sql import functions as func
prova_df.groupBy("order_item_order_id").agg(func.sum("order_item_subtotal")).show()

Which gives an output

这给出了一个输出

SUM('order_item_subtotal)
129.99000549316406       
579.9500122070312        
199.9499969482422        
634.819995880127         
434.91000747680664 

Which I'm not so sure if it's doing the right thing. Why isn't it showing also the information from the first column? Thanks in advance for your answers

我不太确定它是否在做正确的事情。为什么它不显示第一列中的信息?预先感谢您的回答

回答by zero323

Why isn't it showing also the information from the first column?

为什么它不显示第一列中的信息?

Most likely because you're using outdated Spark 1.3.x. If thats the case you have to repeat grouping columns inside aggas follows:

很可能是因为您使用的是过时的 Spark 1.3.x。如果是这种情况,您必须agg按如下方式重复对列进行分组:

(df
    .groupBy("order_item_order_id")
    .agg(func.col("order_item_order_id"), func.sum("order_item_subtotal"))
    .show())

回答by luminousmen

You can use partitioning and window function for that:

您可以为此使用分区和窗口函数:

df.withColumn(value_field, f.sum("order_item_subtotal") \
  .over(Window.partitionBy("order_item_order_id"))) \
  .show()

回答by Zac Roberts

A similar solution for your problem using PySpark 2.7.xwould look like this:

使用PySpark 2.7.x解决您的问题的类似解决方案如下所示:

df = spark.createDataFrame(
    [(1, 299.98),
    (2, 199.99),
    (2, 250.0),
    (2, 129.99),
    (4, 49.98),
    (4, 299.95),
    (4, 150.0),
    (4, 199.92),
    (5, 299.98),
    (5, 299.95),
    (5, 99.96),
    (5, 299.98)],
    ['order_item_order_id', 'order_item_subtotal'])

df.groupBy('order_item_order_id').sum('order_item_subtotal').show()

Which results in the following output:

这导致以下输出:

+-------------------+------------------------+
|order_item_order_id|sum(order_item_subtotal)|
+-------------------+------------------------+
|                  5|       999.8700000000001|
|                  1|                  299.98|
|                  2|                  579.98|
|                  4|                  699.85|
+-------------------+------------------------+