Python PySpark：使用过滤功能后取一列的平均值

Question

提问by Harit Vishwakarma

I am using the following code to get the average age of people whose salary is greater than some threshold.

我正在使用以下代码来获取工资高于某个阈值的人的平均年龄。

dataframe.filter(df['salary'] > 100000).agg({"avg": "age"})

the column age is numeric (float) but still I am getting this error.

列年龄是数字（浮点数），但我仍然收到此错误。

py4j.protocol.Py4JJavaError: An error occurred while calling o86.agg. 
: scala.MatchError: age (of class java.lang.String)

Do you know any other way to obtain the avg etc. without using groupByfunction and SQL queries.

您是否知道在不使用groupBy函数和 SQL 查询的情况下获得平均值等的任何其他方法。

Answer 1

Aggregation function should be a value and a column name a key:

聚合函数应该是一个值和一个列名一个键：

dataframe.filter(df['salary'] > 100000).agg({"age": "avg"})

Alternatively you can use pyspark.sql.functions:

或者，您可以使用pyspark.sql.functions：

from pyspark.sql.functions import col, avg

dataframe.filter(df['salary'] > 100000).agg(avg(col("age")))

It is also possible to use CASE .. WHEN

也可以使用 CASE .. WHEN

from pyspark.sql.functions import when

dataframe.select(avg(when(df['salary'] > 100000, df['age'])))

Answer 2

You can try this too:

你也可以试试这个：

dataframe.filter(df['salary'] > 100000).groupBy().avg('age')