Python PySpark:使用过滤功能后取一列的平均值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32550478/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 11:46:55  来源:igfitidea点击:

PySpark: Take average of a column after using filter function

pythonapache-sparkapache-spark-sqlpysparkpyspark-sql

提问by Harit Vishwakarma

I am using the following code to get the average age of people whose salary is greater than some threshold.

我正在使用以下代码来获取工资高于某个阈值的人的平均年龄。

dataframe.filter(df['salary'] > 100000).agg({"avg": "age"})

the column age is numeric (float) but still I am getting this error.

列年龄是数字(浮点数),但我仍然收到此错误。

py4j.protocol.Py4JJavaError: An error occurred while calling o86.agg. 
: scala.MatchError: age (of class java.lang.String)

Do you know any other way to obtain the avg etc. without using groupByfunction and SQL queries.

您是否知道在不使用groupBy函数和 SQL 查询的情况下获得平均值等的任何其他方法。

采纳答案by zero323

Aggregation function should be a value and a column name a key:

聚合函数应该是一个值和一个列名一个键:

dataframe.filter(df['salary'] > 100000).agg({"age": "avg"})

Alternatively you can use pyspark.sql.functions:

或者,您可以使用pyspark.sql.functions

from pyspark.sql.functions import col, avg

dataframe.filter(df['salary'] > 100000).agg(avg(col("age")))

It is also possible to use CASE .. WHEN

也可以使用 CASE .. WHEN

from pyspark.sql.functions import when

dataframe.select(avg(when(df['salary'] > 100000, df['age'])))

回答by Ahmed Gehad

You can try this too:

你也可以试试这个:

dataframe.filter(df['salary'] > 100000).groupBy().avg('age')