scala 多次触发数据帧分组
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/41771327/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
spark dataframe groupby multiple times
提问by Ramesh
val df = (Seq((1, "a", "10"),(1,"b", "12"),(1,"c", "13"),(2, "a", "14"),
              (2,"c", "11"),(1,"b","12" ),(2, "c", "12"),(3,"r", "11")).
          toDF("col1", "col2", "col3"))
So I have a spark dataframe with 3 columns.
所以我有一个包含 3 列的 spark 数据框。
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   a|  10|
|   1|   b|  12|
|   1|   c|  13|
|   2|   a|  14|
|   2|   c|  11|
|   1|   b|  12|
|   2|   c|  12|
|   3|   r|  11|
+----+----+----+
My requirement is actually I need to perform two levels of groupby as explained below.
我的要求实际上是我需要执行两个级别的 groupby,如下所述。
Level1: If I do groupby on col1 and do a sum of Col3. I will get below two columns. 1. col1 2. sum(col3) I will loose col2 here.
Level1:如果我在 col1 上做 groupby 并做 Col3 的总和。我将低于两列。1. col1 2. sum(col3) 我会在这里松开 col2。
Level2: If i want to again group by on col1 and col2 and do a sum of Col3 I will get below 3 columns. 1. col1 2. col2 3. sum(col3)
级别2:如果我想再次在 col1 和 col2 上分组并计算 Col3 的总和,我将得到 3 列以下。1. col1 2. col2 3. sum(col3)
My requirement is actually I need to perform two levels of groupBy and have these two columns(sum(col3) of level1, sum(col3) of level2) in a final one dataframe.
我的要求实际上是我需要执行两个级别的 groupBy 并在最后一个数据帧中包含这两列(level1 的 sum(col3),level2 的 sum(col3))。
How can I do this, can anyone explain?
我该怎么做,谁能解释一下?
spark : 1.6.2 Scala : 2.10
火花:1.6.2 斯卡拉:2.10
回答by Psidom
One option is to do the two sum separately and then join them back:
一种选择是分别计算两个总和,然后将它们合并:
(df.groupBy("col1", "col2").agg(sum($"col3").as("sum_level2")).
    join(df.groupBy("col1").agg(sum($"col3").as("sum_level1")), Seq("col1")).show)
+----+----+----------+----------+
|col1|col2|sum_level2|sum_level1|
+----+----+----------+----------+
|   2|   c|      23.0|      37.0|
|   2|   a|      14.0|      37.0|
|   1|   c|      13.0|      47.0|
|   1|   b|      24.0|      47.0|
|   3|   r|      11.0|      11.0|
|   1|   a|      10.0|      47.0|
+----+----+----------+----------+
Another option is to use the window functions, considering the fact that the level1_sum is the sum of level2_sum grouped by col1:
另一种选择是使用窗口函数,考虑到 level1_sum 是 level2_sum 的总和分组为col1:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"col1")
(df.groupBy("col1", "col2").agg(sum($"col3").as("sum_level2")).
    withColumn("sum_level1", sum($"sum_level2").over(w)).show)
+----+----+----------+----------+
|col1|col2|sum_level2|sum_level1|
+----+----+----------+----------+
|   1|   c|      13.0|      47.0|
|   1|   b|      24.0|      47.0|
|   1|   a|      10.0|      47.0|
|   3|   r|      11.0|      11.0|
|   2|   c|      23.0|      37.0|
|   2|   a|      14.0|      37.0|
+----+----+----------+----------+

