Python Pyspark - 多列聚合

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36251004/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 17:37:44  来源:igfitidea点击:

Pyspark - Aggregation on multiple columns

pythonpython-2.7apache-sparkpyspark

提问by Mohan

I have data like below. Filename:babynames.csv.

我有如下数据。文件名:babynames.csv。

year    name    percent     sex
1880    John    0.081541    boy
1880    William 0.080511    boy
1880    James   0.050057    boy

I need to sort the input based on year and sex and I want the output aggregated like below (this output is to be assigned to a new RDD).

我需要根据年份和性别对输入进行排序,并且我希望输出汇总如下(此输出将分配给新的 RDD)。

year    sex   avg(percentage)   count(rows)
1880    boy   0.070703         3

I am not sure how to proceed after the following step in pyspark. Need your help on this

我不确定在 pyspark 中执行以下步骤后如何继续。需要你的帮助

testrdd = sc.textFile("babynames.csv");
rows = testrdd.map(lambda y:y.split(',')).filter(lambda x:"year" not in x[0])
aggregatedoutput = ????

回答by zero323

  1. Follow the instructions from the READMEto include spark-csvpackage
  2. Load data

    df = (sqlContext.read
        .format("com.databricks.spark.csv")
        .options(inferSchema="true", delimiter=";", header="true")
        .load("babynames.csv"))
    
  3. Import required functions

    from pyspark.sql.functions import count, avg
    
  4. Group by and aggregate (optionally use Column.alias:

    df.groupBy("year", "sex").agg(avg("percent"), count("*"))
    
  1. 按照README 中的说明包含spark-csv
  2. 加载数据

    df = (sqlContext.read
        .format("com.databricks.spark.csv")
        .options(inferSchema="true", delimiter=";", header="true")
        .load("babynames.csv"))
    
  3. 导入所需函数

    from pyspark.sql.functions import count, avg
    
  4. 分组和聚合(可选地使用Column.alias

    df.groupBy("year", "sex").agg(avg("percent"), count("*"))
    

Alternatively:

或者

  • cast percentto numeric
  • reshape to a format ((year, sex), percent)
  • aggregateByKeyusing pyspark.statcounter.StatCounter
  • 转换percent为数字
  • 重塑为格式 (( year, sex), percent)
  • aggregateByKey使用 pyspark.statcounter.StatCounter