scala 如何在 Spark 1.6 的窗口聚合中使用 collect_set 和 collect_list 函数?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45131481/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-22 09:21:32  来源:igfitidea点击:

How to use collect_set and collect_list functions in windowed aggregation in Spark 1.6?

scalaapache-sparkapache-spark-sqlapache-spark-1.6

提问by Dzmitry Haikov

In Spark 1.6.0 / Scala, is there an opportunity to get collect_list("colC")or collect_set("colC").over(Window.partitionBy("colA").orderBy("colB")?

在 Spark 1.6.0 / Scala 中,是否有机会获得collect_list("colC")collect_set("colC").over(Window.partitionBy("colA").orderBy("colB")

回答by Ramesh Maharjan

Given that you have dataframeas

既然你有dataframe作为

+----+----+----+
|colA|colB|colC|
+----+----+----+
|1   |1   |23  |
|1   |2   |63  |
|1   |3   |31  |
|2   |1   |32  |
|2   |2   |56  |
+----+----+----+

You can Windowfunctions by doing the following

您可以Window通过执行以下操作

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
df.withColumn("colD", collect_list("colC").over(Window.partitionBy("colA").orderBy("colB"))).show(false)

Result:

结果:

+----+----+----+------------+
|colA|colB|colC|colD        |
+----+----+----+------------+
|1   |1   |23  |[23]        |
|1   |2   |63  |[23, 63]    |
|1   |3   |31  |[23, 63, 31]|
|2   |1   |32  |[32]        |
|2   |2   |56  |[32, 56]    |
+----+----+----+------------+

Similar is the result for collect_setas well. But the order of elements in the final setwill not be in order as with collect_list

也有类似的结果collect_set。但是最终元素的顺序set不会像collect_list

df.withColumn("colD", collect_set("colC").over(Window.partitionBy("colA").orderBy("colB"))).show(false)
+----+----+----+------------+
|colA|colB|colC|colD        |
+----+----+----+------------+
|1   |1   |23  |[23]        |
|1   |2   |63  |[63, 23]    |
|1   |3   |31  |[63, 31, 23]|
|2   |1   |32  |[32]        |
|2   |2   |56  |[56, 32]    |
+----+----+----+------------+

If you remove orderByas below

如果你删除orderBy如下

df.withColumn("colD", collect_list("colC").over(Window.partitionBy("colA"))).show(false)

result would be

结果将是

+----+----+----+------------+
|colA|colB|colC|colD        |
+----+----+----+------------+
|1   |1   |23  |[23, 63, 31]|
|1   |2   |63  |[23, 63, 31]|
|1   |3   |31  |[23, 63, 31]|
|2   |1   |32  |[32, 56]    |
|2   |2   |56  |[32, 56]    |
+----+----+----+------------+

I hope the answer is helpful

我希望答案有帮助