Python 如何在pyspark中的groupBy之后计算唯一ID

Question

提问by Lizou

I'm using the following code to agregate students per year. The purpose is to know the total number of student for each year.

我正在使用以下代码每年汇总学生。目的是知道每年的学生总数。

from pyspark.sql.functions import col
import pyspark.sql.functions as fn
gr = Df2.groupby(['Year'])
df_grouped = 
gr.agg(fn.count(col('Student_ID')).alias('total_student_by_year'))

The result is :

结果是：

[students by year][1]

[按年级划分的学生][1]

The problem that I discovered that so many ID's are repeated So the result is wrong and huge.

我发现重复了这么多ID的问题所以结果是错误的并且巨大的。

I want to agregate the students by year, count the total number of student by year and ovoid the repetition of ID's.

我想按年汇总学生，按年计算学生总数并避免重复 ID。

I hope the question is clear. I'am new member Thanks

我希望这个问题很清楚。我是新会员谢谢

Answer 1

回答by pauli

Use countDistinctfunction

使用countDistinct函数

from pyspark.sql.functions import countDistinct
x = [("2001","id1"),("2002","id1"),("2002","id1"),("2001","id1"),("2001","id2"),("2001","id2"),("2002","id2")]
y = spark.createDataFrame(x,["year","id"])

gr = y.groupBy("year").agg(countDistinct("id"))
gr.show()

output

输出

+----+------------------+
|year|count(DISTINCT id)|
+----+------------------+
|2002|                 2|
|2001|                 2|
+----+------------------+

Answer 2

回答by information_interchange

You can also do:

你也可以这样做：

gr.groupBy("year", "id").count().groupBy("year").count()

This query will return the unique students per year.

此查询将返回每年唯一的学生。

Python 如何在pyspark中的groupBy之后计算唯一ID

提问by Lizou

回答by pauli

回答by information_interchange

相关推荐

最近更新

标签

Python 如何在pyspark中的groupBy之后计算唯一ID

提问by Lizou

回答by pauli

回答by information_interchange

相关推荐

Python 转换为 GUI 时，int() 无法使用显式基数转换非字符串

OpenCV中detectMultiScale的参数使用Python

Python 导入错误：DLL 加载在 Jupyter 笔记本中失败，但在 .py 文件中工作

Python 精细控制学术论文 Seaborn 图中的字体大小

相关推荐

最近更新

标签