scala 如何在 Spark 1.6 的窗口聚合中使用 collect_set 和 collect_list 函数?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45131481/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to use collect_set and collect_list functions in windowed aggregation in Spark 1.6?
提问by Dzmitry Haikov
In Spark 1.6.0 / Scala, is there an opportunity to get collect_list("colC")or collect_set("colC").over(Window.partitionBy("colA").orderBy("colB")?
在 Spark 1.6.0 / Scala 中,是否有机会获得collect_list("colC")或collect_set("colC").over(Window.partitionBy("colA").orderBy("colB")?
回答by Ramesh Maharjan
Given that you have dataframeas
既然你有dataframe作为
+----+----+----+
|colA|colB|colC|
+----+----+----+
|1 |1 |23 |
|1 |2 |63 |
|1 |3 |31 |
|2 |1 |32 |
|2 |2 |56 |
+----+----+----+
You can Windowfunctions by doing the following
您可以Window通过执行以下操作
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
df.withColumn("colD", collect_list("colC").over(Window.partitionBy("colA").orderBy("colB"))).show(false)
Result:
结果:
+----+----+----+------------+
|colA|colB|colC|colD |
+----+----+----+------------+
|1 |1 |23 |[23] |
|1 |2 |63 |[23, 63] |
|1 |3 |31 |[23, 63, 31]|
|2 |1 |32 |[32] |
|2 |2 |56 |[32, 56] |
+----+----+----+------------+
Similar is the result for collect_setas well. But the order of elements in the final setwill not be in order as with collect_list
也有类似的结果collect_set。但是最终元素的顺序set不会像collect_list
df.withColumn("colD", collect_set("colC").over(Window.partitionBy("colA").orderBy("colB"))).show(false)
+----+----+----+------------+
|colA|colB|colC|colD |
+----+----+----+------------+
|1 |1 |23 |[23] |
|1 |2 |63 |[63, 23] |
|1 |3 |31 |[63, 31, 23]|
|2 |1 |32 |[32] |
|2 |2 |56 |[56, 32] |
+----+----+----+------------+
If you remove orderByas below
如果你删除orderBy如下
df.withColumn("colD", collect_list("colC").over(Window.partitionBy("colA"))).show(false)
result would be
结果将是
+----+----+----+------------+
|colA|colB|colC|colD |
+----+----+----+------------+
|1 |1 |23 |[23, 63, 31]|
|1 |2 |63 |[23, 63, 31]|
|1 |3 |31 |[23, 63, 31]|
|2 |1 |32 |[32, 56] |
|2 |2 |56 |[32, 56] |
+----+----+----+------------+
I hope the answer is helpful
我希望答案有帮助

