python,pyspark:获取pyspark数据帧列值的总和

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39504950/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 22:21:21  来源:igfitidea点击:

python, pyspark : get sum of a pyspark dataframe column values

pythonpysparkpyspark-sql

提问by Satya

say I have a dataframe like this

说我有一个这样的数据框

name age city
abc   20  A
def   30  B

i want to add a summary row at the end of the dataframe, so result will be like

我想在数据框的末尾添加一个摘要行,所以结果会像

name age city
abc   20  A
def   30  B
All   50  All

So String 'All', I can easily put, but how to get sum(df['age']) ###column object is not iterable

所以字符串'All',我可以很容易地放置,但是如何获得 sum(df['age']) ###column 对象是不可迭代的

data = spark.createDataFrame([("abc", 20, "A"), ("def", 30, "B")],["name", "age", "city"])
data.printSchema()
#root
 #|-- name: string (nullable = true)
 #|-- age: long (nullable = true)
 #|-- city: string (nullable = true)
res = data.union(spark.createDataFrame([('All',sum(data['age']),'All')], data.columns))  ## TypeError: Column is not iterable
#Even tried with data['age'].sum() and got error.   If i am using [('All',50,'All')], it is doing fine. 

I usually work on Pandas dataframe and new to Spark. Might be my undestanding about spark dataframe is not that matured.

我通常在 Pandas 数据框上工作,并且是 Spark 的新手。可能是我对火花数据框的不了解还没有那么成熟。

Please suggest, how to get the sum over a dataframe-column in pyspark. And if there is any better way to add/append a row to end of a dataframe. Thanks.

请建议,如何获得 pyspark 中数据框列的总和。如果有更好的方法将一行添加/附加到数据帧的末尾。谢谢。

回答by swenzel

Spark SQL has a dedicated module for column functions pyspark.sql.functions.
So the way it works is:

Spark SQL 有一个专门的列函数模块pyspark.sql.functions
所以它的工作方式是:

from pyspark.sql import functions as F
data = spark.createDataFrame([("abc", 20, "A"), ("def", 30, "B")],["name", "age", "city"])

res = data.unionAll(
    data.select([
        F.lit('All').alias('name'), # create a cloumn named 'name' and filled with 'All'
        F.sum(data.age).alias('age'), # get the sum of 'age'
        F.lit('All').alias('city') # create a column named 'city' and filled with 'All'
    ]))
res.show()

Prints:

印刷:

+----+---+----+
|name|age|city|
+----+---+----+
| abc| 20|   A|
| def| 30|   B|
| All| 50| All|
+----+---+----+

回答by GwydionFR

A dataframe is immutable, you need to create a new one. To get the sum of your age, you can use this function: data.rdd.map(lambda x: float(x["age"])).reduce(lambda x, y: x+y)

数据框是不可变的,您需要创建一个新的。要获得您的年龄总和,您可以使用此函数:data.rdd.map(lambda x: float(x["age"])).reduce(lambda x, y: x+y)

The way you add a row is fine, but why would you do such a thing? Your dataframe will be hard to manipulate and you wont be able to use aggregations functions unless you drop the last line.

添加行的方式很好,但是为什么要这样做呢?您的数据框将难以操作,除非您删除最后一行,否则您将无法使用聚合函数。