python,pyspark:获取pyspark数据帧列值的总和
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39504950/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
python, pyspark : get sum of a pyspark dataframe column values
提问by Satya
say I have a dataframe like this
说我有一个这样的数据框
name age city
abc 20 A
def 30 B
i want to add a summary row at the end of the dataframe, so result will be like
我想在数据框的末尾添加一个摘要行,所以结果会像
name age city
abc 20 A
def 30 B
All 50 All
So String 'All', I can easily put, but how to get sum(df['age']) ###column object is not iterable
所以字符串'All',我可以很容易地放置,但是如何获得 sum(df['age']) ###column 对象是不可迭代的
data = spark.createDataFrame([("abc", 20, "A"), ("def", 30, "B")],["name", "age", "city"])
data.printSchema()
#root
#|-- name: string (nullable = true)
#|-- age: long (nullable = true)
#|-- city: string (nullable = true)
res = data.union(spark.createDataFrame([('All',sum(data['age']),'All')], data.columns)) ## TypeError: Column is not iterable
#Even tried with data['age'].sum() and got error. If i am using [('All',50,'All')], it is doing fine.
I usually work on Pandas dataframe and new to Spark. Might be my undestanding about spark dataframe is not that matured.
我通常在 Pandas 数据框上工作,并且是 Spark 的新手。可能是我对火花数据框的不了解还没有那么成熟。
Please suggest, how to get the sum over a dataframe-column in pyspark. And if there is any better way to add/append a row to end of a dataframe. Thanks.
请建议,如何获得 pyspark 中数据框列的总和。如果有更好的方法将一行添加/附加到数据帧的末尾。谢谢。
回答by swenzel
Spark SQL has a dedicated module for column functions pyspark.sql.functions
.
So the way it works is:
Spark SQL 有一个专门的列函数模块pyspark.sql.functions
。
所以它的工作方式是:
from pyspark.sql import functions as F
data = spark.createDataFrame([("abc", 20, "A"), ("def", 30, "B")],["name", "age", "city"])
res = data.unionAll(
data.select([
F.lit('All').alias('name'), # create a cloumn named 'name' and filled with 'All'
F.sum(data.age).alias('age'), # get the sum of 'age'
F.lit('All').alias('city') # create a column named 'city' and filled with 'All'
]))
res.show()
Prints:
印刷:
+----+---+----+
|name|age|city|
+----+---+----+
| abc| 20| A|
| def| 30| B|
| All| 50| All|
+----+---+----+
回答by GwydionFR
A dataframe is immutable, you need to create a new one. To get the sum of your age, you can use this function: data.rdd.map(lambda x: float(x["age"])).reduce(lambda x, y: x+y)
数据框是不可变的,您需要创建一个新的。要获得您的年龄总和,您可以使用此函数:data.rdd.map(lambda x: float(x["age"])).reduce(lambda x, y: x+y)
The way you add a row is fine, but why would you do such a thing? Your dataframe will be hard to manipulate and you wont be able to use aggregations functions unless you drop the last line.
添加行的方式很好,但是为什么要这样做呢?您的数据框将难以操作,除非您删除最后一行,否则您将无法使用聚合函数。