scala 多次触发数据帧分组

Question

提问by Ramesh

val df = (Seq((1, "a", "10"),(1,"b", "12"),(1,"c", "13"),(2, "a", "14"),
              (2,"c", "11"),(1,"b","12" ),(2, "c", "12"),(3,"r", "11")).
          toDF("col1", "col2", "col3"))

So I have a spark dataframe with 3 columns.

所以我有一个包含 3 列的 spark 数据框。

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   a|  10|
|   1|   b|  12|
|   1|   c|  13|
|   2|   a|  14|
|   2|   c|  11|
|   1|   b|  12|
|   2|   c|  12|
|   3|   r|  11|
+----+----+----+

My requirement is actually I need to perform two levels of groupby as explained below.

我的要求实际上是我需要执行两个级别的 groupby，如下所述。

Level1: If I do groupby on col1 and do a sum of Col3. I will get below two columns. 1. col1 2. sum(col3) I will loose col2 here.

Level1：如果我在 col1 上做 groupby 并做 Col3 的总和。我将低于两列。1. col1 2. sum(col3) 我会在这里松开 col2。

Level2: If i want to again group by on col1 and col2 and do a sum of Col3 I will get below 3 columns. 1. col1 2. col2 3. sum(col3)

级别2：如果我想再次在 col1 和 col2 上分组并计算 Col3 的总和，我将得到 3 列以下。1. col1 2. col2 3. sum(col3)

My requirement is actually I need to perform two levels of groupBy and have these two columns(sum(col3) of level1, sum(col3) of level2) in a final one dataframe.

我的要求实际上是我需要执行两个级别的 groupBy 并在最后一个数据帧中包含这两列（level1 的 sum(col3)，level2 的 sum(col3)）。

How can I do this, can anyone explain?

我该怎么做，谁能解释一下？

spark : 1.6.2 Scala : 2.10

火花：1.6.2 斯卡拉：2.10

Answer 1

回答by Psidom

One option is to do the two sum separately and then join them back:

一种选择是分别计算两个总和，然后将它们合并：

(df.groupBy("col1", "col2").agg(sum($"col3").as("sum_level2")).
    join(df.groupBy("col1").agg(sum($"col3").as("sum_level1")), Seq("col1")).show)

+----+----+----------+----------+
|col1|col2|sum_level2|sum_level1|
+----+----+----------+----------+
|   2|   c|      23.0|      37.0|
|   2|   a|      14.0|      37.0|
|   1|   c|      13.0|      47.0|
|   1|   b|      24.0|      47.0|
|   3|   r|      11.0|      11.0|
|   1|   a|      10.0|      47.0|
+----+----+----------+----------+

Another option is to use the window functions, considering the fact that the level1_sum is the sum of level2_sum grouped by col1:

另一种选择是使用窗口函数，考虑到 level1_sum 是 level2_sum 的总和分组为col1：

import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"col1")

(df.groupBy("col1", "col2").agg(sum($"col3").as("sum_level2")).
    withColumn("sum_level1", sum($"sum_level2").over(w)).show)

+----+----+----------+----------+
|col1|col2|sum_level2|sum_level1|
+----+----+----------+----------+
|   1|   c|      13.0|      47.0|
|   1|   b|      24.0|      47.0|
|   1|   a|      10.0|      47.0|
|   3|   r|      11.0|      11.0|
|   2|   c|      23.0|      37.0|
|   2|   a|      14.0|      37.0|
+----+----+----------+----------+

scala 多次触发数据帧分组

提问by Ramesh

回答by Psidom

相关推荐

最近更新

标签

scala 多次触发数据帧分组

提问by Ramesh

回答by Psidom

相关推荐

scala Spark Dataframes - 按键减少

scala 使用 Spark DataFrame 将数据插入 Cassandra 表

scala SparkSQL - 直接读取镶木地板文件

scala 如何使用scala从数据框中获取字符串列的最大长度？

相关推荐

最近更新

标签