Python 如何在pyspark中的groupBy之后计算唯一ID
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/46421677/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How Count unique ID after groupBy in pyspark
提问by Lizou
I'm using the following code to agregate students per year. The purpose is to know the total number of student for each year.
我正在使用以下代码每年汇总学生。目的是知道每年的学生总数。
from pyspark.sql.functions import col
import pyspark.sql.functions as fn
gr = Df2.groupby(['Year'])
df_grouped =
gr.agg(fn.count(col('Student_ID')).alias('total_student_by_year'))
The result is :
结果是:
[students by year][1]
[按年级划分的学生][1]
The problem that I discovered that so many ID's are repeated So the result is wrong and huge.
我发现重复了这么多ID的问题所以结果是错误的并且巨大的。
I want to agregate the students by year, count the total number of student by year and ovoid the repetition of ID's.
我想按年汇总学生,按年计算学生总数并避免重复 ID。
I hope the question is clear. I'am new member Thanks
我希望这个问题很清楚。我是新会员谢谢
回答by pauli
Use countDistinctfunction
使用countDistinct函数
from pyspark.sql.functions import countDistinct
x = [("2001","id1"),("2002","id1"),("2002","id1"),("2001","id1"),("2001","id2"),("2001","id2"),("2002","id2")]
y = spark.createDataFrame(x,["year","id"])
gr = y.groupBy("year").agg(countDistinct("id"))
gr.show()
output
输出
+----+------------------+
|year|count(DISTINCT id)|
+----+------------------+
|2002| 2|
|2001| 2|
+----+------------------+
回答by information_interchange
You can also do:
你也可以这样做:
gr.groupBy("year", "id").count().groupBy("year").count()
gr.groupBy("year", "id").count().groupBy("year").count()
This query will return the unique students per year.
此查询将返回每年唯一的学生。