Python 如何在pyspark中的groupBy之后计算唯一ID

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/46421677/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 17:38:21  来源:igfitidea点击:

How Count unique ID after groupBy in pyspark

pythonpysparkspark-dataframepyspark-sql

提问by Lizou

I'm using the following code to agregate students per year. The purpose is to know the total number of student for each year.

我正在使用以下代码每年汇总学生。目的是知道每年的学生总数。

from pyspark.sql.functions import col
import pyspark.sql.functions as fn
gr = Df2.groupby(['Year'])
df_grouped = 
gr.agg(fn.count(col('Student_ID')).alias('total_student_by_year'))

The result is :

结果是:

[students by year][1]

[按年级划分的学生][1]

The problem that I discovered that so many ID's are repeated So the result is wrong and huge.

我发现重复了这么多ID的问题所以结果是错误的并且巨大的。

I want to agregate the students by year, count the total number of student by year and ovoid the repetition of ID's.

我想按年汇总学生,按年计算学生总数并避免重复 ID。

I hope the question is clear. I'am new member Thanks

我希望这个问题很清楚。我是新会员谢谢

回答by pauli

Use countDistinctfunction

使用countDistinct函数

from pyspark.sql.functions import countDistinct
x = [("2001","id1"),("2002","id1"),("2002","id1"),("2001","id1"),("2001","id2"),("2001","id2"),("2002","id2")]
y = spark.createDataFrame(x,["year","id"])

gr = y.groupBy("year").agg(countDistinct("id"))
gr.show()

output

输出

+----+------------------+
|year|count(DISTINCT id)|
+----+------------------+
|2002|                 2|
|2001|                 2|
+----+------------------+

回答by information_interchange

You can also do:

你也可以这样做:

gr.groupBy("year", "id").count().groupBy("year").count()

gr.groupBy("year", "id").count().groupBy("year").count()

This query will return the unique students per year.

此查询将返回每年唯一的学生。