Python Pyspark - 多列聚合
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36251004/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pyspark - Aggregation on multiple columns
提问by Mohan
I have data like below. Filename:babynames.csv.
我有如下数据。文件名:babynames.csv。
year name percent sex
1880 John 0.081541 boy
1880 William 0.080511 boy
1880 James 0.050057 boy
I need to sort the input based on year and sex and I want the output aggregated like below (this output is to be assigned to a new RDD).
我需要根据年份和性别对输入进行排序,并且我希望输出汇总如下(此输出将分配给新的 RDD)。
year sex avg(percentage) count(rows)
1880 boy 0.070703 3
I am not sure how to proceed after the following step in pyspark. Need your help on this
我不确定在 pyspark 中执行以下步骤后如何继续。需要你的帮助
testrdd = sc.textFile("babynames.csv");
rows = testrdd.map(lambda y:y.split(',')).filter(lambda x:"year" not in x[0])
aggregatedoutput = ????
回答by zero323
- Follow the instructions from the READMEto include
spark-csv
package Load data
df = (sqlContext.read .format("com.databricks.spark.csv") .options(inferSchema="true", delimiter=";", header="true") .load("babynames.csv"))
Import required functions
from pyspark.sql.functions import count, avg
Group by and aggregate (optionally use
Column.alias
:df.groupBy("year", "sex").agg(avg("percent"), count("*"))
- 按照README 中的说明包含
spark-csv
包 加载数据
df = (sqlContext.read .format("com.databricks.spark.csv") .options(inferSchema="true", delimiter=";", header="true") .load("babynames.csv"))
导入所需函数
from pyspark.sql.functions import count, avg
分组和聚合(可选地使用
Column.alias
:df.groupBy("year", "sex").agg(avg("percent"), count("*"))
Alternatively:
或者:
- cast
percent
to numeric - reshape to a format ((
year
,sex
),percent
) aggregateByKey
usingpyspark.statcounter.StatCounter
- 转换
percent
为数字 - 重塑为格式 ((
year
,sex
),percent
) aggregateByKey
使用pyspark.statcounter.StatCounter