Python PySpark:使用过滤功能后取一列的平均值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32550478/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
PySpark: Take average of a column after using filter function
提问by Harit Vishwakarma
I am using the following code to get the average age of people whose salary is greater than some threshold.
我正在使用以下代码来获取工资高于某个阈值的人的平均年龄。
dataframe.filter(df['salary'] > 100000).agg({"avg": "age"})
the column age is numeric (float) but still I am getting this error.
列年龄是数字(浮点数),但我仍然收到此错误。
py4j.protocol.Py4JJavaError: An error occurred while calling o86.agg.
: scala.MatchError: age (of class java.lang.String)
Do you know any other way to obtain the avg etc. without using groupBy
function and SQL queries.
您是否知道在不使用groupBy
函数和 SQL 查询的情况下获得平均值等的任何其他方法。
采纳答案by zero323
Aggregation function should be a value and a column name a key:
聚合函数应该是一个值和一个列名一个键:
dataframe.filter(df['salary'] > 100000).agg({"age": "avg"})
Alternatively you can use pyspark.sql.functions
:
或者,您可以使用pyspark.sql.functions
:
from pyspark.sql.functions import col, avg
dataframe.filter(df['salary'] > 100000).agg(avg(col("age")))
It is also possible to use CASE .. WHEN
也可以使用 CASE .. WHEN
from pyspark.sql.functions import when
dataframe.select(avg(when(df['salary'] > 100000, df['age'])))
回答by Ahmed Gehad
You can try this too:
你也可以试试这个:
dataframe.filter(df['salary'] > 100000).groupBy().avg('age')